What is data engineering?

There’s a lot of confusion about what data engineering is and what data engineers do. While aspects of data engineering have been present ever since businesses began analyzing and reporting data, it gained significant attention with the emergence of data science in the 2010s.

Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. It takes dedicated specialists—data engineers—to maintain data so that it remains available and usable by others. In short, data engineers set up and operate the organization’s data infrastructure, preparing it for further analysis by data analysts and scientists.

From: Data Engineering and Its Main Concepts by AlexSoft

Evolution of the Data Engineer

Understanding data engineering today and tomorrow requires a context of how the field evolved.

The early days

Data engineering can trace its origins to the 1970s, while data warehousing emerged in the late 1980s. Bill Inmon, often hailed as the pioneer of data warehousing, introduced the term between 1989 and 1990. Simultaneously, IBM's Barry Devlin and Paul Murphy conceptualized the "business data warehouse." Additionally, IBM engineers were behind the development of the relational database and Structured Query Language (SQL), with Oracle playing a crucial role in bringing the technology to the forefront.

As data infrastructures expanded, businesses required tools and data pipelines for reporting and business intelligence, commonly referred to as "BI." To assist enterprises in accurately shaping their business logic within the data warehouse, both Ralph Kimball and Bill Inmon formulated their distinct data-modeling methods and strategies.

Data warehouse and BI engineering were a precursor for today's data engineering, which are still widely used today.

Around the mid-1990s, the internet's rapid rise in popularity gave birth to a new wave of web companies like Google, Yahoo, and Amazon. The dot-com surge resulted in an influx of data activity across web applications and backend systems. Many of these systems were costly, bulky, and came with hefty licenses. It's probable that the vendors of these backend systems hadn't anticipated the sheer volume of data that would be generated by web applications.

The early 2000s and the birth of modern data engineering

In the aftermath of the dot-com bubble burst in the early 2000s, only a handful of companies emerged unscathed. Among these survivors, companies like Google and Amazon evolved into global tech powerhouses. Still, for a time, they continued in using conventional relational databases and data warehouses, stretching these systems to their breaking point. A new approach were needed to handle data growth and be cost-effective, scalable, available, and reliable.

Affordable commodity computing hardware became widely available to the masses. Groundbreaking advancements enabled distributed computation and storage across extensive computing clusters on an unprecedented scale. These developments led to the fragmentation and decentralization of what were once monolithic services. The “big data” era had begun.

In 2003, Google published a paper detailing the Google File System, a filesystem designed for reliable and efficient data access using vast clusters of affordable commodity hardware. Not long after, in 2004, a paper on MapReduce, a highly-scalable data-processing model tailored for commodity hardware. While the concept of big data had predecessors in data warehouses and data management for experimental physics endeavors, Google's revelations acted as a pivotal moment for data technologies. This "big bang" effect paved the way for new open-source big data tools, laying the foundation for modern data engineering practices as we know it today.

The 2000s and 2010s: Big data and Streaming data

Open-source big data tools swiftly evolved, extending their reach from Silicon Valley to global tech firms. These tools leveled the playing field, allowing any enterprise to leverage the same data utilities employed by leading tech giants. Subsequently, the shift from batch computing to event streaming marked another transformative moment, creating a new age of big "real-time" data.

Engineers could choose the latest and greatest of a huge amount of new technologies that came on the data engineering scene. Today, data is moving faster than ever and growing larger.

As "big data" gained traction and businesses eagerly adopted the trend, it wasn't without repercussions. Big data captured the imagination of a business trying to make sense of the ever-growing volumes of data. Also big data vendors marketed selling big data tools and services while over-promising their potential. This overemphasis led many businesses to deploy big data solutions for relatively minor data challenges. Instances where big data platforms processed merely a few gigabytes of data weren't rare. The buzz around "big data" eventually waned. However, big data didn't vanish; rather, it underwent a simplification.

The 2020s and engineering of the data lifecycle

The role of data engineering is rapidly transforming. It's now a discipline centered around integrating diverse technologies to achieve business objectives

Over recent years, with significant abstraction and simplification, data engineers aren't tied down by outdated big data frameworks. While they can retain expertise in fundamental data programming and employ it when necessary, their role has shifted towards more critical aspects of the value chain. This includes security, data management, data operations, data architecture, orchestration, and overall data lifecycle management.

There is a noticeable shift in the attitudes of data engineers and companies. New projects and services are increasingly concerned with managing and governing data, how to make data easier to use and discover, and improving its quality. Data engineers are now engineering data pipelines, they concern themselves with privacy, anonymization, data garbage collection, and compliance with regulations.

Data management, including data quality and governance was common for large enterprises in the pre-big-data era, but it wasn’t as widely adopted in smaller companies. Now technologists and entrepreneurs have shifted focus back to the enterprises, with an emphasis on decentralization and agility.

Data Engineering and Data Sciene

What is the difference between these two? Data engineering is not a subdicipline, but rather a separate thing compared to Data Science. However, they do complement each other.

dataengineering is upstream and data science is downstream As you can see, they are different. Data engineers will provide the inputs that goes upstream and data scientists will convert data that comes downstream to them into something usefull.

Many data scients love to build and tune machine learning models, but the reality is that mostly 70% to 80% of their time is spent on working the the bottom three parts of the following image: the data science hiearchy of needs From: The data science hierarchy of needs Most data scientist only spend a fraction of their time on analytics and machine learning. Any business needs to focus on the importance for a solid data foundation before moving on to work with artifical intelligence and machine learning.

Data scientists aren’t usually trained to engineer production-level data systems. But sometimes they end up doing this work, because they lack the support and resources of a data engineer. Ideally, data scientists should spend most of their time focused on the top layers of the pyramid, and work with analytics, experimentation, and machine learning. When data engineers focus on these bottom parts of the hierarchy, they build a solid foundation for data scientists to succeed.

We think data engineering is of equal importance to data science, and data engineers are playing an extremely important role together with your data scientists to make your business successful.

Data Maturity

The level of data complexity within a company depends on the company's data maturity. This impacts both data engineers and data scientists day to day job and progression.

Data maturity is the progression toward efficient data utilization and capabilities and the integration across the organization. Big well established companies can get outcompeted by early-stage startup companies as what matters is the way data is leveraged as a competitive advantage.

A company that has just started with data is by definition in the early stages of its data maturity. The company may have fuzzy, losely defined goals or maybe even no goals. The data team is small, often with a single digit head count. At this stage, its typical that the data engineer and,or data scientists are generalists and playing several roles.

Companies at this stage may have some success with artifical intelligence, but it is rare. Without a solid data foundation, you will most likely not have the data to train reliable models nor the knowledge to deploy these into business decisions.

A data-driven company however, is a lot more autonomous and enchance the people and business. There is a solid foundation for introducing and taking advantage of new data sources and generated data is available in real-time for everyone in the organization. Collaboration is more efficient and the company can move forward with a higher pace.

Understanding Data Engineering Types

  1. The first type of data engineering is usually SQL-focused, data stored in tables with rows and cells layout like a spreadsheet. The typical processing is done with a relational database.
  2. The second type of data engineering is big data focused, analyzing huge amount of data that often requires cleanup, interpolation, extrapolation, time windows and so on. Processing tools for the second type includes frameworks like, MapReduce, Spark and Flink.
  3. The third type of data engineering is multi-dimensional engineering of relations. The depth of relationships can be massive and difficult to query and contextualize a typical relational database. The tool used for processing data is a Graph Database.

Conclusion

Data engineering is a proffesion, that is getting more and more popular, even though it has existed for quite some time. Data engineering is all about the movement, transformation, and management of data. The growth in data processing has led to major advancements in data technology and data jobs. Presently, data engineers are among the most sought-after professionals in the tech sector, with demand increasing significantly every year. the data science hiearchy of needs

Though the data engineer title isn’t always as attractive as a data scientist title, the work that a data engineer does actually makes up a significant portion of the work carried out by a data scientist.

Understandably, topics like machine learning and artificial intelligence are always going to win a popularity contest. However, a good chunk of the work that sits behind these concepts stems from data engineering work, such as data mining, data manipulation and cleaning, text manipulation, aggregation, joining data to other data sets, and even building data streaming pipelines and machine learning pipelines. All of these skills are essential to a data driven company and are necessary when building machine learning models. Data engineering skills are less common among data scientists, whose backgrounds are often mathematical. This is why together both data engineers and data scientists are powerful roles and complementary combination.

Article sources: Fundamentals of Data Engineering by Joe Reis and Matt Housley
Wikipedia
Data engineering today