Multiple column output from row wise operations.

Umar Khan
Analytics Vidhya
Published in
4 min readFeb 23, 2021

--

I want to talk about a scenario that occurs fairly often when doing any significant amount of data engineering and analysis in Pandas.

We all know that one of the big advantages of Pandas is the ability to broadcast functions to entire data arrays and easily transform large amounts of data. In addition to built in functions like mean(),sum()etc, pandas allows you to apply a custom function to an entire array in just a few lines of code. This is done by means of the apply() method to DataFrames.

We have all generated new pandas columns by applying a function to an existing column. The scenario I found myself in more than once now is wanting to generate more than one column from a function. This way, I can have multiple calculations or manipulations done in one pass.

Let me give a specific example. Recently, I found myself working on a project involving evaluating “news” articles to determine their accuracy, political bent and whether or not they contained hate speech. Focusing on the hate speech, I used a CountVectorizer to tabulate the number of known hateful words and phrases in each article in my dataset.

If you recall, CountVectorizer generates a matrix of terms and the count of their occurences in each body of text fed into it (which you can then concatenate back onto the original dataset). I wanted to then do a few things. I wanted to multiply the number of hits for each phrase by a known probability of it being associated with hate speech and add these up to create a “hate score”. I wanted a total count of hits as well, and also a dictionary containing each phrase found and the number of times it was found.

My final dataset would then have three new columns: “hate_score”, “hate_counts” and “hate_terms”. The best way to do this was to use the apply function on the dataframe and have it go row by row. The best way I found to do this was to use a lambda function and a custom function in the apply method.

This is what the apply function looked like:

df_vectors = df_vectors.apply(lambda row:tally_counts_doc(row),axis=1,result_type=’expand’)

Here is how this works. If you pass a lambda function into apply(), it will apply it to the rows one by one. What gets passed into the lambda function is the entire row, and it gets passed in as a pandas Series. The index of this series is the column names from the original table and the values are the values in the row in each of those columns. You can then proceed to slice, read and modify this series. Note that we had to specify the “axis” argument in the apply method as “1”. This is to ensure it goes through row by row. If we wanted to iterate through columns instead, we could specify axis as “0” and this would return the column as a series.

I have found this to be a powerful feature. Especially in conjunction with custom functions that operate on the row-Series. So for instance in the code above, the lambda function passes the row ( as a Series) into the “tally_counts_doc”. Lets take a look at this function:

def tally_counts_doc(row):

row2 = row[row > 0]

score = 0

hits = {}

for index,val in row2.items():

hits[index]=val

hit = dict_hateweights[index] * val

score += hit

row[‘hate_score’] = score

row[‘hate_hits’] = hits

return row

This function, as we would expect, takes as input a row (again, in the form of a series). Recall that this dataframe is made up of the original articles and the matrix of phrase count vectors. Most of the columns are thus 0 or null, after all we got this section of the DF from a sparse matrix. We are only interested in those columns where the value is more than 0. Thanks to the lambda function, this is easy since we can simply get the entire row as a series and then simply filter it with basic Series filtering syntax (row2 = row[row > 0]).

Once we get the row as a series though, recall that we wanted to do multiple operations on it and get multiple outputs, each of which we wanted to add as a second column. Thanks to the use the lambda function in the apply method, this is as easy as simply adding a new item to the row-Series. A new row to the Series that represents the original row, if you will, i.e.:

row[‘hate_score’] = score

row[‘hate_hits’] = hits

Once the function has performed its operations on the row and added “columns” to it (really rows to the series representing the original row), it can simply return the new row back to the lambda function, which then passes it back to the original dataframe. Another note here; in addition to the axis argument in the apply method, you must also specify the keyword argument “result_type” as “expanded”. This gives Pandas a heads up that its about to get a row with new columns and need to add them to the original dataframe.

Easy. In this way you can run multiple operations on a row in one go instead of repeatedly loading it into memory. Hopefully this should make for faster and cleaner code.

I hope you found this useful!

--

--

Umar Khan
Analytics Vidhya

Just an attorney who wandered into data science and never wanted to leave.