Apache Beam- an easy guide.

Umar Khan
Analytics Vidhya
Published in
5 min readMar 10, 2021

--

Big Data frameworks are all the rage these days. For good reason: the amount of data out there that needs to be processed and anlyzed is tremendous. So much so that oftentimes, it is impossible to work with some datasets on our personal computers.

This is where Big Data comes in. The idea is to use multiple computers to process a given dataset. This generally entails spinning up a number of virtual machines in a cloud service like AWS or Azure and then using specific frameworks like Spark or Dask to tackle your data problem. In this way we can harness dozens, sometimes even hundereds or thousands of individual computers (or “nodes”) to tackle our data.

So how does one actually go about using multiple machines in this way? As I mentioned, typically this requires the use of specific frameworks. The most famous among these are MapReduce and Spark. These frameworks are different than doing analysis or ML using traditional python libraries. They are almost different languages, or at least extension of the languages we typically use. They also work differently and we have to adapt to them to make them work. Then, comes the not so easy step of deploying our code to an actual cluster.

Recently, at my internship I was trying to analyse a relatively large dataset of news articles. I found that trying to run simple vectorization on these would make my colab instance collapse. I realized it was time to up my game and enter the world of Big Data.

First I experimented with Spark. Then I heard about something called Dask. Finally someone recommended Apche Beam. The upside to Apache Beam is that when used in conjunction with Google’s Dataflow, it takes the cluster setup aspect out of the equation almost entirely. *Almost* entirely. We still have to pass in some configurations, but we are spared the need to mess with Docker images and manually setting up worker nodes etc.

Unfortunately the documentation for Beam is sparse, and what documentation does exist can be hard to figure out. So, having gotten Beam to work for myself I thought I might write up a post to help others who may find themselves on the same road as me.

There are two primary components to Beam; Pcollections and PTransforms. Pcollections are the data structures Beam uses, and PTransforms are transformations applied to those PCollections. In a typical Beam script, we read data into a PCollection, apply transforms to that data, and the output of that application is written into another PCollection.

Lets look at some sample code:

We begin by importing the beam module. Then, we instantiate a Pipeline object from that module (see the line starting with “with beam.Pipeline”). This pipeline object is where we build the actual pipeline we want to be run on our cluster. The main operator in pipelines as you may see is the “ | “ operator. This operator basically tells Beam to take the inputs on the left of it and feed them into the process to the right. The outcome of this process is then stored in the variable.

So for instance in the code above, in the very first line of the pipeline (i.e. after with beam… we see the line data = p | beam.io.parquetio.ReadFromParquet. Here, we are taking the pipline object itself and feeding it into the ReadFromParquet transform. This transform reads a parquest file and converts it into a PCollection. The very first step in a pipline typically involves feeding a pipline object into a transform in this fashion. The Pcollection generated by this transform is stored as “data”.

In the very next line, we take this “data” variable and pipe it into another transform, specfically a ParDo transform. This is a special type of transform unique to Beam. Although beam has numerous native transforms such as sum, grou etc, the ParDo trasform is a flexible way to perform your own custom transformations onto data. ParDo will essentially apply the specified fucntion to every row of the Pcollection.

Note that we pass into ParDo something called WordExtractingDFn. This is a custom function we built for Beam to use. Well, in actuality its an object. The way custom functions work in Beam is that we must extend the DoFn class and define a method in this extension called “process”, and that is where the logic of our function is written. Heres a full example:

Here you see the method in practice. We define a new class called WordExtractingDoFn, which extends the beam.DoFn class that is built into Beam. We then define our “process” method, which has the logic we want to apply to our current row, which is passed into the method as “element”. In this case, we are using the tokenize function(defined elsewhere — which is fine as long as its being called inside the process method. Only then will it actually be applied by Beam). We generate tokens, then various ngrams from those tokens, then we do some custom term counting and scoring. We add new “columns” to the row to hold our output ( see “element[‘hates_score’] etc). Then we return the new row (as element) and this row is then added to the output Pcollection.

On important concept to consider in Beam is “Schemas”. Schemas are basically the names of columns in our Pcollection. Oftentimes, we have to specify these Schemas by hand. Thats not the case here, because when you read in a Parquet file, Beam will infer a schema. This is not the case with CSV files however, and you must specify schemas if you wish to refer to specific fields (i.e. columns) in the row that you pass into your DoFn method. Here is an example of how to create schemas:

Note that the schema is built using the pyarrow library’s schema method.

So that was a quick primer on how to get started with Apache Beam. Stay tuned for the next post where I will walk you through deploying the pipeline to the cloud!

--

--

Umar Khan
Analytics Vidhya

Just an attorney who wandered into data science and never wanted to leave.