0

I have a pandas dataframe where 'Column1' and 'Column2' contain lists of words in every row. I need to create a new column with the number of words repeated in Column1's list and Column2's list for every row. For example, in an especific row I could have ['apple', 'banana'] in Column1, ['banana', 'orange'] in Column2, and I need to add a third new column containing the number '1', since only one word (banana) is in both lists.

I tried to do it like this:

        for index, row in df.iterrows():
            value = len(list(set(row['Column1']) & set(row['Column2'])))
            row['new_column'] = value

But the new column did not appear in the dataframe. I tried a second approach, creating the column first and setting it to 0 and then updating the values like this:

        df['new_column'] = 0
        for index, row in df.iterrows():
            value = len(list(set(row['Column1']) & set(row['Column2'])))
            df.at[index,'new_column'] = value

But this didn't work either, the column is not updated. I tried a third approach using .apply like this:

df['new_column'] = df.apply(lambda x: len(list(set(x['Column1']) & set(x['Column2'])))

And then I got this error:

KeyError: 'Column1'

I don't know why any of this is working and neither I know any other way to try it. How can I make this work? Thank you!

user14484895
  • 29
  • 1
  • 7
  • 1
    How are you changing the calculations formula for different rows? – Rahul Khanna Apr 15 '21 at 10:44
  • 2
    may be you dont have to iterate. you have to show us a [dummy dataframe](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and your calculation for us to be able to help you. Until then the question is unclear. – anky Apr 15 '21 at 10:45
  • 1
    For most cases, there is no need to iterate over the dataframe row by row as you can call [`.apply()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html). – albert Apr 15 '21 at 10:46

2 Answers2

0

Your third approach looks good, but there are still two problems:

  1. a syntax error, you forgot one closing bracket at the end;
  2. to apply the function to each row, you need to pass axis=1.

Here's some mock data:

import pandas as pd

df = pd.DataFrame({'Column1': [['this', 'is', 'it'],
                               ['apple', 'orange', 'banana']], 
                   'Column2': [['is', 'it', 'so'], 
                               ['orange', 'grape', 'fruit']]})

So this should do the trick:

df['new_column'] = df.apply(lambda x: len(list(set(x['Column1']) & set(x['Column2']))), 
                            axis=1)
    Column1                  Column2                 new_column
0   [this, is, it]           [is, it, so]            2
1   [apple, orange, banana]  [orange, grape, fruit]  1
Arne
  • 9,990
  • 2
  • 18
  • 28
0

It's a bit confusing, but the problem with your last try was that you've run over the wrong axis.

Add the parameter axis=1 (or 'columns') should fix it:

df['new_column'] = df.apply(lambda x: len(list(set(x['Column1']) & set(x['Column2']))),axis=1)

You may see it in the documentation here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

Nimrod Carmel
  • 479
  • 2
  • 9