Foundations of Data Engineering
The article will be structured into three parts: the first will introduce a couple of definitions; the second part will dive deep into the main difference between a data warehouse and a data lake; and the last part will explain the data mart in an affordable language.
Definitions
Data Lake People focus a lot on the unstructured aspect of the data lake. It should be good for storing and processing unstructured data, but the lake is much more than that. This is where you put data whose value you have not yet proven. If I’m at work and think up a new set of external data that I can have by scraping the web, the data lake is where I’d like to land that data. Then I can merge this data with content from the corporate data warehouse to prove their value to different areas of the business. Proving the value of data is an essential role of data lakes.
This article focused on the profitability of the data lake, which makes sense mainly because you haven’t yet proven the value of the data in it. Once the value is proven, you can start working with your ETL developers to see if you can build a data pipeline to get the valuable new dataset to land in your data warehouse.
This is also the point at which you must decide on the level of data governance you want to apply. Please note that not all data types will be included in the data warehouse. So, if the data is absolutely unstructured, such as images, audio, or videos, it will almost definitely be more cost-effective to keep it in the lake until its worth has been proven. This, in my opinion, is where things get really fuzzy.
Data Warehouse is where you store your existing important corporate data that is generated by business processes and consumed in reports. It should be governed in a reasonable manner, considering the confidentiality of the data and the importance of accuracy for downstream applications. In general, you want the material to be in a format that is simple to consume for business users. For this, star schemas and OLAP cubes are ideal. The emphasis here is on simplicity, ease of use, and future adaptability to new sources. Business procedures should be prioritised. What are you attempting to quantify? These are the facts that are used in fact tables. What criteria are you using to organise these facts? These are dimensions that are used in dimension tables and are listed here. Keep it simple and stupid and resist the impulse to normalise. It should look nothing like an operational database seen in a web application or transactional system.
Database is some system or software designed to store, manage, access, or work with any amount of data in an organised, structured, and productive way. Databases can do things like enforce a specific standard, validate data as it comes in or goes out, link multiple pieces of data into a chain of importance, and do other useful things to ensure your data isn’t the equivalent of that one USB stick you have with dozens of nested directories named “ideas” and “amazing information.” This is the software used to create data lakes and data warehouses. The common categories are relational, object-oriented, document-based, and distributed tastes like HDFS on Hadoop. I’ve heard a lot of people use AWS S3 as only standard object storage as a sort of back-end for data lakes and other similar things. Other important factors to consider here include whether your database is row-based, column-based, in-memory, distributed, and so on.
Data Mart: A subset of a data warehouse that is focused on a single department, area of business, or subgroups of users. It enables those users to easily retrieve and analyse information specifically related to their needs and business functions without having to sift through or worry about the rest of the data on which the data warehouse operates.
Data Warehouse vs Data Lake.
Data warehouses first appeared decades ago as a way for organisations to store and organise their data for analytics, that is, to ask questions (queries) about the data in order to reveal answers, trends, and other insights. Data warehouses also manage data marts for various work groups. Data marts frequently entail transferring a copy of the source data from the data warehouse to the various data marts.
Every day, data warehouses and their associated data marts handle thousands, if not millions, of queries. The most important queries range from reporting order trends to discovering common demographic information about customers to forecasting likely business trends.
Each data source within a traditional data warehouse has its own schema, or table row-column definition, which governs how the data is organised. As a result, the data’s attributes must be known ahead of time. Data lakes, on the other hand, store newer semi-structured data types in their native format without requiring a schema to be defined ahead of time.
Traditional data warehouse tables cannot accommodate newer, semi-structured data types from sources such as weblogs, clickstreams, mobile applications, or social media networks. This type of data must be transformed into a more structured format before it can be stored and studied in a data warehouse.
Traditional data lakes, at the very least, can store these mixed data types. But that’s only the beginning. To evaluate that data, you’ll need highly technical data analytics and data science experts, which are in short supply. If you are able to hire these experts, they may end up deriving beneficial insights from the data for an inordinate amount of time. If you rely on a traditional data warehouse or a traditional data lake, you will rarely gain all possible insights.
What is a data mart?
In this post, I’ll explain the concept of a data mart in layman’s terms. Data marts are an important component of a data warehouse, and before reading this section, you will need to clear the details from your memory so you can begin again. A data mart is a subset of all the data that stakeholders can access (stakeholders are internal users of the warehouse). The purpose of this post is to explain why I decided to subset and how I decided what goes into each.
Assume Ismael is five years old and has a large collection of toys. These characteristics are important to the analogy, but it’s also worth noting that he has a three-year-old sister, Zyra. Toys are frequently requested and received by children.It’s great that they can express their desires. To put it another way, it’s a toy. Toys can be found everywhere. Ismael will play with toys as soon as he receives them, but he will eventually end up in a toy basket.
Toys are information. Stakeholders have not meant the data that data queries no longer mean salary, so they have always asked for the data here. The toy bin, on the other hand, is a common terminated data lake in all forms and all forms. Have you ever gone through a toy bin looking for something? You will almost certainly be stabbed by a G.I. Joe’s plastic gun thing. Barbie’s feet have pushed so deeply under my nail bed that I’ve bled.
You also won’t find what you’re looking for; there is a bench included with the kit, but no large plastic hammer. Or else you’ll think it’s the strange whack-a-mole game hammer and keep looking. As a result, Ismael did not receive as many of his toys as the supplier had hoped. Consider how many times you’ve seen an aunt and they say, “Oh, does Ismael like the drum set I bought him?” and you immediately realise that the little trash can you passed by because it was missing the lid is actually an upside-down bin. Similarly, if I ask a stakeholder, “Are you making use of the new data that I obtained for you?” I know I did it wrong, and they mumble about not having time to find it or sort it out. After the first request, the toy was discarded.
When we merely supply stakeholders with a data lake, they may not find what they are searching for, or they may discover only a portion of what they are seeking for, or they may be harmed by misreported data. Is there really any point to having a lot of data if individuals can’t put it to good use? As a result, we must split the toys. The simplest approach to minimise the size of the toy box by two is to segregate Zyra and Ismael’s toys. This makes the information easier to navigate, boosting your chances of finding what you’re searching for without injuring yourself.
But Zyra still wishes she had what her brother has, and Ismael despises it when Zyra has what he doesn’t. Worst of all, it prevented them from playing together. Ismael and Zyra have their own divisions in this example; Ismael is in the marketing department, while Zyra is in the finance department. When we divide the data into a finance data warehouse and a marketing data warehouse, everything may appear exactly the same as it did at the start.
What happens, though, when the marketing department discovers that the finance has cost data that they don’t? What happens when finance discovers that marketing has consumer information that they don’t? If the data is available, there is no need to keep it hidden from anyone. But how can we maintain equality while keeping our toy bins separate? We can go out and get the two toys the kids want, but it will be costly. What if we buy the wrong one and then buy the wrong one again? Have you ever handed the incorrect toy to a child? We don’t want to duplicate the identical tables in two data stores, just as we don’t want to buy twice as many toys. This is both time and money-consuming in terms of engineering; do you want to update the table? I had a good time doing it in two places. It also doubles the likelihood of mistakes or inconsistencies. If marketing wishes to modify the customer conversion definition but does not update it for the finance data warehouse, you will be unable to explain why individuals report two distinct conversion numbers.
We want Ismael and Zyra to get the most out of their toys, but we don’t want to overspend or let them down, and we don’t want a cluttered toy box. We want all stakeholders who can utilise the dataset to have access to it, but we don’t need to increase the cost, nor do we choose a data lake that no one can use. So, what should we do? We’ve organised the toys into categories. The dummy toolbox goes into a corner with all the pieces, and the strange whack-a-mole game goes to another corner with his hammer. Ismael and Zyra could play together or independently of any toy collection in their space. It takes longer to set up all of the areas rather than just one crate per child, but it allows for more sharing and lowers the expense of changing toys.
Furthermore, because all of the pieces fit perfectly, it enables everyone to get more out of their toys. We can have a combined stack for a universal toy, such as a Barbie that can be played in a dollhouse or a Jeep GI, but it will be smaller than the original toy box. and will have precise objectives. A scheduling table is a representation of a conceivable dataset in a basic data warehouse; all departments require date information. This is what I mean by analogy.
I say data marts should be by subject area, not by department.
A lot of this is based on my own experience, but that experience began by reading The Data Warehouse Toolkit. Kimball refers to the departmental approach (Ismael and Zyra having separate bins) as Independent Data Mart Architecture, writing that
Multiple uncoordinated extracts from the same operational sources and redundant storage of analytic data are inefficient and wasteful in the long run. Without any enterprise perspective, this independent approach results in myriad standalone point solutions that perpetuate incompatible views of the organization’s performance, resulting in unnecessary organisational debate and reconciliation. We strongly discourage the independent data mart approach.
It’s safe to say I agree, and Ismael and Zyra would too.