Data Lakes
The term “Data Lake” really came about with the advent of big data and cloud computing. The metaphor of a lake is an apt one, since it is primarily composed of a variety of different elements (water, meditation, fish, etc.) used for a variety of purposes (boating, fishing, drinking water, swimming, etc.). It is the central repository where all data rivers flow into.
What makes it different than your typical data warehouse or cube is the fact that data lakes consist of many disparate forms of data, whereas a data warehouse tends to be organized by a schema and contains data neatly organized in tables.
A data lake has other capabilities as well, such as a high speed transfer mechanism both for incoming and outgoing traffic. For instance, if you have network clients wanting to do real time stream processing, you would require enough bandwidth to be able to handle both the incoming and outgoing streams. Even batch analytics require a fairly robust pipeline into the data repository given the volume ofdata they typically consume.
With machine learning, and in particular while attempting to solve a deep learning problem, the data sets are so large it’s often not feasible to move them back-and-forth between a network node and the cloud, or sometimes even in between nodes in the cloud itself. The advantage of using a data lake is that you can have your machine learning system directly access the data, process it, train the model (in the case of a supervised learning algorithm ), and then insert the model back into the data lake as a component of it. The main benefit being that instead of moving the data around during the model building and refinement process, you leave it in one place and work with it where it is to perform analytics and machine learning — both with structured and unstructured data.
DRAFT Chap VI — Bruce Haydon ©2021, 2022