Big data is changing the way today’s organisations do business. They collect massive amounts of data, which needs to be collected and managed. To give some idea of how much data is out there, the World Economic Forum estimates that by 2025, a staggering 463 exabytes of data will be created globally every day (1). Abundant organisational data is meaningless unless it is made accessible and useful. This is where data engineering comes in. It is the magic that harvests and transforms data from raw bits and bytes to real information that optimises performance and ensures a competitive edge. Data engineering is absolutely vital for successful digital transformation.
Data engineering traditionally involves designing and building systems able to convert massive amounts of raw data from numerous, often disparate, sources and move them into a single warehouse. Here the data is available as uniform, usable information promoting self-service amongst end users and empowering data analysts to harness the power of centralised data.
It therefore involves the collection, storage, and manipulation of data in ways that make it possible for businesses to function.
If performed expertly, data engineering delivers numerous benefits, including:
Better decision-making
Quicker access to data
Competitive advantage
Quality data, as errors and inaccuracies are prevented
Improved efficiency
Cost-effectiveness
Increased data security
As the data engineering field evolves, exciting new trends emerge. Tech-savvy organisations are racing to leverage data engineering tools and expertise to democratise their data and get their hands on it more rapidly.
In the past, organisations would rely on one business unit to distribute data to everyone else. Although simpler to manage, data was far less useful and the lack of data sharing often resulted in silos.
Businesses today increasingly embrace data democratisation, which is the process of making data understandable and available for everyone in the business, technical or non-technical. It enables the average business person to access, gather and analyse information without expert help.
This shift has resulted in a plethora of new tools, expertise, and trends, some of which are discussed below.
Some modern data warehouse solutions, including Snowflake, allow data providers to seamlessly share data with users by making it available as a feed. This does away with the need for pipelines, as live data is shared in real time without having to move the data. In this scenario, providers do not have to create APIs or FTPs to share data and there is no need for consumers to create data pipelines to import it. This is especially useful for activities such as data monetisation or company mergers, as well as for sectors such as the supply chain. Microsoft’s new unified SaaS offering, Fabric, offers this functionality on an even richer scale.
Organisations that use data lakes to store large sets of structured and semi-structured data are now tending to create traditional data warehouses on top of them, thus generating more value. Known as a data lakehouse, this single platform combines the benefits of data lakes and warehouses. It is able to store unstructured data while providing the functionality of a data warehouse, to create a strategic data storage/management system. In addition to providing a data structure optimised for reporting, the data lakehouse provides a governance and administration layer and captures specific domain-related business rules.
Data analytics architectures which include data lakes and warehouses often become too complex to maintain. Other challenges include bottlenecks, and the lack of domain knowledge on the part of data teams, who may not understand the HR, or finance, or logistics domains for example. This tends to cause delays in meeting user requirements timeously and within budget.
Data mesh architectures have therefore evolved to ensure alignment with the business and deliver data products more rapidly. Using the data mesh approach, data engineers set up the infrastructure and metadata to govern and catalogue the stored data sets. The domain teams, comprising experts from specific domains within the organisation, then use this infrastructure to create their own data products. The data engineers are freed to focus on building and supporting the infrastructure needed for all the domains to produce and share their data.
The data mesh architecture ultimately results in decentralised ownership with centralised governance, as well as decentralised storage and centralised infrastructure. The approach is particularly suitable for businesseswith a data-driven culture, intent on digital transformation.
Microsoft Fabric is ideally positioned for this approach. Its OneLake storage architecture centralises infrastructure and governance while enabling the decentralisation of domain data storage.
Another trend is toward low code data integration tools. These have gained popularity as they enable the rapid development of data pipelines. Low code tools are ideal for meeting the increasing demand for rapid delivery with today’s critical shortage of developer skills. They enable applications development using drag-and-drop functionality and visual guidance. More people, including those with no coding experience or knowledge, can contribute. Business analysts and domain experts can create ETL processes and data models while data engineers can focus on complex data pipelines and provide support.
Fivetran and Matillion are among the more common low-code ETL tools. Fivetran offers a multitude of connectors and data models through tools such as open-source data build tool. Fivetran data models are packaged SQL scripts for popular data source connectors and analytics use cases that can be run in data build tool to generate new reports quickly without data engineering overhead.
Python is one of the most popular programming languages today and many find Excel essential to organise, manipulate and analyse data. Until now, however, the two have not worked together easily.In August 2023 Microsoft announced that Python in Excel was now in preview. The new product will make it possible to integrate Python and Excel analytics within the same Excel grid for uninterrupted workflow.
Microsoft Fabric was one of the biggest announcements at Microsoft Build 2023. According to the company, its new offering is an end-to-end, unified analytics platform that brings together all the data and analytics tools that organisations need. It has generated huge hype in the data and analytics space, mainly because it empowers non-technical users to create their own data products. These users benefit from low-code/no-code and the SaaS experience, as well as the fact that Fabric offers all aspects of data warehousing in one product, including analytics development.
The OneLake data lake comes automatically with Fabric. Microsoft claims it improves collaboration and provides a single source of truth for all the organisation’s analytical data, allowing for ease of governance and security controls.
Keyrus’s data engineering experts understand the importance of staying abreast of the major trends and new technologies in the data engineering arena. The consultancy partners with many leading technology providers and has a proven track record of designing sophisticated data engineering solutions for its customers.
Keyrus was recently chosen as the partner of choice by Vector Logistics for an ambitious cloud-powered modern analytics solution. The supply chain specialist faced several business challenges including an outdated business analytics system. Multiple platforms resulted in duplication of effort, resources, and costs while existing models were either no longer relevant or unable to offer self-service for business users.
Keyrus designed and implemented a seamless, integrated, fit-for-purpose analytics platform to enable self-service and drive a single source of truth. The cloud-powered solution involved a number of Microsoft Azure and Power BI technologies. Keyrus’s data engineers leveraged tabular models and Power BI to enable large-scale models to drive self-service capability.
The solution ensures that sales and route data, pulled from disparate source systems, is initially staged and massaged before being made available to the tabular model. This model consolidates the data from the SQL staging layer and SAP and presents it in the correct star schema format to the Power BI reporting layer. The architecture, therefore, consists of source, staging, semantic, and presentation/reporting layers.
The major immediate benefits were a 20% improvement in visibility, a 3.5-hour reduction in manpower requirements each week, self-service, significantly reduced costs, and a much lighter administration load.
In addition, the frequency of reports increased from once a week to daily, and the information is now available immediately via the Power BI mobile app. Vector Logistics now has a consolidated view of its internal and external sales as well as planned versus actual transport-related data, one of the major requirements of this project.
As the solution is completely cloud-based, the organisation no longer needs to maintain on-premise servers and automation has replaced tedious manual workaround Excel spreadsheets.
Through this and other projects, Keyrus showcases its expertise in tackling complex problems and engineering straightforward, effective, and scalable solutions for its customers.
Keyrus is your trusted partner for building sustainable, high-performing data architectures, utilising the latest data engineering tools and strategies. We are ready to help you design an actionable data strategy that will deliver your business objectives and drive commercial success. Contact us at sales@keyrus.co.za.
Source: https://www.weforum.org/agenda/2019/04/how-much-data-is-generated-each-day-cf4bddf29f/