Get To Know — Data Platform Engineer Role at LINE MAN Wongnai

Published in

Life@LINE MAN Wongnai

6 min readNov 8, 2023

Data has become an important asset for modern businesses. The more data you can collect, the more your business can benefit from those insights.

I am Jirawech Siwawut, a Data Engineering Manager at LMWN. Today I would like to talk about how I came to work in my current position and give you some ideas about the Data Platform Engineer role. This article will show how this data-driven magic happens behind the scenes — the heroes who build and maintain the infrastructure that makes data-driven decision-making possible.

A bit of history

In 2019, I started my career at LINE Thailand company as a Data Scientist. I work with the LINE MAN team to improve its on-demand application for many services such as food Delivery, taxi, and messenger.

My role is similar to other Data Scientists in any company: getting to know your data, exploring and finding insights, then turning your insights into a business opportunity, or even creating an ML model to predict something. This gives me a chance to learn how valuable data can be in modern business. Without good data, it could be difficult for Data Scientists to achieve their job.

Working as a Data Scientist involves lots of steps and processes such as data preparation, data exploration, and data mining. Here are some common issues I constantly encounter in my day-to-day work:

The number of data is lower than usual, Is something wrong with the data pipeline?
The model training process is as long as forever, do we have any higher performance CPU or GPU machine?
How do I serve an ML model in real-time? Do we have a ready-to-use tool?
This query is so slow. It takes forever to run.
I cannot run this query, the service seems to be down.

All above issues are also the major hurdles to other people in the company who need data for their business decisions and operations e.g. BI, Data Analysts, business teams, etc. Someone needs to create reliable and scalable solutions in order to serve the needs of data users in various aspects.

Introducing Data Platform Engineer

“Our role is to create a data ecosystem that is reliable, scalable and easy to use for everyone.”

You might be familiar with the term Data Engineer for the past few years. The term originated from the rise of Big Data and Machine Learning, together with many other roles such as Data Scientist, Machine Learning Engineer, and Data Analyst. Most companies would have a slight difference in describing each role depending on the need and requirement. Here are some short descriptions.

Data Engineer — Focus on bringing data from sources to centralized data storage e.g. data warehouse, data lake with quality and reliable process
Data Analyst — Focus on analyzing data to provide business insight, reports and visualization, to improve service, also to create new business opportunity from data
Data Scientist — Focus on broader knowledge to find valuable data, not only data analysis but also predictive modeling, machine learning, and the development of data-driven solutions.

The position and function of each role in the whole data pipeline can be summarized as follows:

As you can see in the picture, data infrastructure is an important foundation for the whole data pipeline. You will need someone to build and maintain it. This is the job of the Data Platform Engineer.

Data Platform Engineers need to decide what data lake/data warehouse products you want to use for the whole company, what data processing you want to use for data pipeline, what ML platforms and services you want to build for the Data Scientist team to run ML server, and what visualization and analytic tools is best for your Data Analyst or BI team.

If your company uses commercial data platform solutions or solutions from cloud providers such as Databrick, GCP, AWS, or Azure, there is no need to have a Data Platform Engineer role in your company. Most of tedious works and complex components are taken care of by the providers.

This is not the case for LMWN. Since we use our private cloud services in our company, we need to build and maintain most of our data infrastructure by ourselves. This is the main responsibility of Data Platform Engineer at LMWN.

Turning Point

In the beginning, our data architecture was quite simple because we did not have much data. We used PostgreSQL as our data lake for data analysis, Pentaho and crontab for data ingestion tool and scheduling. It was quite troublesome to manage PostgreSQL. We went through lots of changes and improvements to make things work e.g. table partitioning, database indexing, and database replication. Besides, there are other issues with our first generation data infrastructure.

PostgreSQL is hard to scale up or scale out. It is not designed for complex query or big data.
Pentaho is a standalone and runs as single process.
Crontab is not suitable for scheduling or task management.

Our second generation of data architecture is not far from other modern architecture in many companies these days.

We migrated from a traditional data warehouse to data lake using Hadoop, Hive, and Spark.
For data ingestion and processing, we migrated from Pentaho to Apache Spark to run jobs on a daily basis or even hourly.
The query layer is changed from PosgreSQL to Presto/Trino.
Crontab is also replaced by Apache Airflow to orchestrate hundreds of data pipelines and thousands of jobs every day.

There are some tools I want to talk more in depth.

Apache Airflow — Pipeline Orchestrator

It is much easier to re-run, control, and monitor our data pipeline from Airflow Web UI rather than a command line. The components below are examples of tools that could improve the overall data infrastructure.

Using just Crontab and shell scripts for data pipeline is vert hard to manage. It worked for a few pipelines at the beginning, but things started to fail when the number of pipelines and their complexity increased.

At LMWN, we have been using Apache Airflow as the main pipeline orchestrator for many years. More than a thousand tasks are scheduled, run, and monitored to execute and process data in our company. Airflow is also widely used by many internal teams including Data Engineers, Data Scientists, BI, and Analytics Engineers.

The role of a Data Platform Engineer is to maintain Airflow server, improve or upgrade it, and monitor or scale it if needed. If our pipeline system is not reliable and prone to failure every day, how much would it cost the whole company?

As a Data Platform Engineer, there are several key metrics that we need to monitor every day:

Number of workers
Memory and CPU resources for each worker
Scheduler heartbeats
Concurrent tasks
Failed tasks

Trino — Query Engine

Sometimes our users can not run the queries in PostgreSQL due to the size of the data. Trino was introduced to our infrastructure solve this problem. Trino will distribute the query execution process to multiple workers, allow it to scale horizontally and let us run any query regardless of the size of the data.

As a Data Platform Engineer, there are several key metrics that we need to monitor every day:

Query execution time
Query queue time
Resource usage both CPU and Memory
Network bandwidth
Concurrent query

Summary

Data Platform Engineer is the combination between SRE and DevOps for Data team. Our duty is to create a data ecosystem that is reliable, scalable, and easy to use for everyone. We need your help to build a better data platform together.

Join us

If you are a Software Engineer who want to explore new challenges in data platform development.
If you are a Data Engineer/Data Scientist who want to expand your knowledge to create better data ecosystem.
If you are a DevOps/SRE/Architect who want to help us improve data architecture and infrastructure together.

For our Data Platform Engineer, click here to apply.
Or check out our Data and Analytics opening positions at https://careers.lmwn.com.