Data Engineering

Bruce Haydon
2 min readMay 9, 2022

--

Data engineering covers two areas: building the pipelines that transport or transform data, and the software engineering surrounding that data. A data engineer is focused on the best practices for software engineering around the movement and transport of data.

Data: Batch, Streaming & Events

In building cloud native applications, there are three paradigms to consider in the processing of data:

Batch Job: a batch job is a process or software-driven procedure repeated on a periodic basis to process finite blocks of data. Batching is probably the most common and the easiest to implement.

Streaming Job: processing a stream of data is more complicated and resource intensive because the data is constantly being updated and there are engineering challenges on how to handle the processing in an effective manner such that no data is lost. A good example is anything processes a time series of data, such as stock prices or market indices.

One hybrid approach is to stream data into aggregated buckets over time, and then batch them later on. For instance, collecting a stream of data into a different bucket every 10 minutes so that at any time a bucket on it contains 10 minutes worth of stream data.

Event-Driven Job:Events have evolved into the preferred mechanism for cloud native applications, given many interactions with users or other systems can be broken down and classified into different events. For example, let’s say you upload a monthly ledger of financial and information into an online storage location, and that storage location (e.g.: Amazon S3) responds to the event of the upload by processing the data through a piece of code.

DRAFT Chap VI — Bruce Haydon ©2021, 2022

--

--

Bruce Haydon
Bruce Haydon

Written by Bruce Haydon

Global Banking: New York, New York

No responses yet