1

Problem: Add a new column to a DataFrame and populate with the values of a column from another DataFrame, depending on a condition, in one line of code similar to list comprehensions.

Example code:

I create a DataFrame called df with some pupil information

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 
        'Maricopa', 'Yuma'])

Then a second DataFrame called df_extra which has a string representation of the year:

extra_data = {'year': [2012, 2013, 2014],
       'yr_string': ['twenty twelve','twenty thirteen','twenty fourteen']}
df_extra = pd.DataFrame(extra_data)

Now how to add the values yr_string as a new column to df where the numerical years match in one line of code?

I can easily do this with a couple of for loops, but would really like to know if this is possible to do in one line, similar to list comprehensions?

I have searched questions already on here, but there is nothing discussing adding a new column to an existing DataFrame from another DataFrame based on a condition in one line.

Engineero
  • 12,340
  • 5
  • 53
  • 75
Andy
  • 919
  • 2
  • 9
  • 22

1 Answers1

2

You can merge the dataframe on the year column.

df.merge(df_extra, how='left', on=['year'])
#     name  reports  year        yr_string
# 0  Jason        4  2012    twenty twelve
# 1  Molly       24  2012    twenty twelve
# 2   Tina       31  2013  twenty thirteen
# 3   Jake        2  2014  twenty fourteen
# 4    Amy        3  2014  twenty fourteen

Basically this says "pull the data from df_extra into df anywhere that the year column matches in df". Note this will return a copy, not modify the dataframe in place.

List comprehensions are still Python loops (that might not be totally technically accurate). With the pandas.merge() method, you get to take advantage of the vectorized, optimized backend code that Pandas uses to operate on its dataframes. Should be faster.

Engineero
  • 12,340
  • 5
  • 53
  • 75
  • How would I modify inplace? Assign this merge to the original df or set copy=False? According to the documentation copy=False is not guaranteed to avoid making a copy – Andy Apr 23 '19 at 20:06
  • 1
    I don't think you can do a merge in place in Pandas. You would just assign the output either to a new dataframe or back to `df` and go from there. – Engineero Apr 23 '19 at 21:54
  • Yes have been playing around with this. When I try and assign back to df it is not including the merge. There is an argument 'copy' which doesn't seem to do anything useful. I've created a question about this, as I want to avoid ending up with multiple DataFrames in memory – Andy Apr 23 '19 at 21:57
  • 1
    You can just do something like `merged_df = df.merge(...)` on one line and then `del df` and `del df_extra` after to free up memory. If they're big enough that you can't store that much in memory, then yeah, that's a whole other question :P – Engineero Apr 23 '19 at 22:08