In the previous blog, we discussed the five stages an organization will go through when embarking on a data journey, in search of the value of its data. In this subsequent blog, we provide you with a brief overview of the typical components that comprise a data analytics platform needed for that.
We will refrain from going into the deep technical stuff but it must be said that discussing building blocks of a platform… is a bit of a techy subject.
A warehouse to store data
You cannot process what you don’t store nor manage. So a starter, data is stored in what is called a “data warehouse”: a database management system optimized to store large amounts of structured data.
Using ACID (atomicity, consistency, isolation, and durability) transactions, a data warehouse ensures consistency and correctness of the data while multiple parties concurrently read or write data. It typically supports columnar data storage, indexing, caching and processing of data. A data warehouse is commonly used for analytics and reporting on structured data when performance, data consistency and large volumes of data are applicable.
A lake to store other data
Besides structured data, an organization has access to unstructured data such as social media data, images, video, audio, chat, survey responses, surveillance data, geo-spatial data, weather data etc. Furthermore, an organization at its disposal semi-structured data such as emails, binary executables, TCP/IP packets, zipped files and web pages.
A so-called “data lake” can be seen as a specialized file system that can store vast amounts of unstructured and semi-structured data, and is optimized for use by analytical processes.
A house by the lake for the best of both worlds
A lake house is a layer directly on top of the data lake which can store structured, semi-structured and unstructured data. It combines the best features of a data warehouse and a data lake into a single unified platform.
A lake house offers fast and flexible on-boarding of data and has the ability to execute complex SQL queries, perform analytics, run machine learning models, and process real-time data. Like a data warehouse it offers ACID transactions. Compared to the traditional data warehouse, a lake house is more cost efficient.
A cluster to process data
Now that we have the data stored, let’s process what we have.
For doing that, we need an analytics cluster. This is a scalable distributed computing environment that exists out of connected compute nodes that can collaboratively process datasets. The nodes are orchestrated by a master node which triggers worker nodes that can execute tasks in parallel. Data is processed in memory which enables lightning-fast data analysis on large amounts of data. Clusters are used for data transformations, aggregations, and machine learning tasks.
The Apache Spark cluster is the best known analytics cluster. Coding of data transformations, aggregations, and machine learning tasks are typically executed via notebooks such as Jupyter notebooks and Databricks notebooks. Coding can typically be done in any languages like SQL, Python, Scala and R which can all be used in parallel.
Dashboards, reports, self-service
Data, and the insights derived from it, will need to be made available to the user audience in order to make well informed decisions.
The reporting layer is typically the layer that gives the users access to and insights in their data. Traditionally, data insights, reports and dashboards are prepared by data analysts and offered via a reporting portal. This dependency can become a bottleneck that prevents organizational agility in the decision making process. Ideally, a self-service capability empowers users to explore data on their own terms. With intuitive user-oriented tools, stakeholders can define their own questions, create their own visualizations and uncover insights that can directly steer their decision-making.
Key to enable such self-service capability is the so called “semantical layer.” This layer translates the rather technical view of how the data is stored into a model that reflects that same data but now in the organization’s language. Using multiple models within the semantical layer provides different users with the optimal perspectives for their specific reporting use case(s). It typically also supports fast development of standard reports.
Implement the right solution for your company
M2-D2 can help your organization in each stage of the data journey, with hands-on consulting and proper cost optimized solutions for your use case. So your organization always stays on track while enjoying the adventure!