1

I have a python data frame as below:

A   B      C
2  [4,3,9] 1
6  [4,8]   2
3  [3,9,4] 3

My goal is to loop through the data frame and compare column B, if column B are the same, the update column C to the same number such as below:

A   B      C
2  [4,3,9] 1
6  [4,8]   2
3  [3,9,4] 1

I tried with the code below:

for i, j in df.iterrows():
  if len(df['B'][i]  ==len(df['B'][j] & collections.Counter(df['B'][i]==collections.Counter(df['B'][j])
     df['C'][j]==df['C'][i]
  else:
     df['C'][j]==df['C'][j]

I got error message unhashable type: 'list'

Anyone knows what cause this error and better way to do this? Thank you for your help!

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Wendy D.
  • 109
  • 6
  • What are you expecting `df['C'][j]==df['C'][j]` to do? That's always True. And your if statement has mismatched parenthesis... In general, looping a dataframe is often incorrect – OneCricketeer Mar 02 '20 at 02:38

3 Answers3

0

Not quite sure about the efficiency of the code, but it gets the job done:

uniqueRows = {}

for index, row in df.iterrows():
    duplicateFound = False
    for c_value, uniqueRow in uniqueRows.items():
        if duplicateFound:
            continue
        if len(row['B']) == len(uniqueRow):
            if len(list(set(row['B']) - set(uniqueRow))) == 0:
                print(c_value)
                df.at[index, 'C'] = c_value
                uniqueFound = True

    if not duplicateFound:
        uniqueRows[row['C']] = row['B']

print(df)
print(uniqueRows)

This code first loops over your dataframe. It has a duplicateFound boolean for each row that will be used later.

It will loop over the uniqueRows dict and first checks if a duplicate is found. In this case it will continue skip the calculations, because this is not needed anymore.

Afterwards it compares the length of the list to skip some comparisons and in case it's the same uses the following code: This returns a list with the differences and in case there are no differences returns an empty list.

So if the list is empty it sets the value from the C column at this position using pandas dataframe at function (this has to be used when iterating over a dataframe link). It sets the unqiueFound variable to True to prevent further comparisons. In case no duplicates were found the uniqueFound value will still be False and will trigger the addition to the uniqueRows dict at the end of the for loop before going to the next row.

In case you have any comments or improvements to my code feel free to discuss and hope this code helps you with your project!

Timo Frionnet
  • 474
  • 3
  • 16
0

Create a temporary column by applying sorted to each entry in the B column; group by the temporary column to get your matches and get rid of the temporary column.

df1['B_temp'] = df1.B.apply(lambda x: ''.join(sorted(x)))

df1['C'] = df1.groupby('B_temp').C.transform('min')

df1 = df1.drop('B_temp', axis = 1)

df1

    A      B        C
0   2   [4, 3, 9]   1
1   6   [4, 8]      2
2   3   [3, 9, 4]   1
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
0

Because lists are not hashable convert lists to sorted tuples and get first values by GroupBy.transform with GroupBy.first:

df['C'] = df.groupby(df.B.apply(lambda x: tuple(sorted(x)))).C.transform('first')
print (df)
   A          B  C
0  2  [4, 3, 9]  1
1  6     [4, 8]  2
2  3  [3, 9, 4]  1

Detail:

print (df.B.apply(lambda x: tuple(sorted(x))))
0    (3, 4, 9)
1       (4, 8)
2    (3, 4, 9)
Name: B, dtype: object
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252