Comparing & matching comma separated strings from different dataframes Python pandas

Question

I want to compare column values from two different data frames, if both values match, insert a new Column in the data frame and add a value in the new column.

DF One:

      col name    col item  
 0     jo          abc    
 1     bo          efg   
 2     mo          xyz

DF Two:

      col name   col item          col code
 0     jo         abc , xyz          123
 1     bo         efg , xyz , zyx    456
 2     mo         abc,  xyz          789

I'm trying to compare the col_name and col_item values in DF One to the values in DF Two. If the name and item from DF One match values from DF Two, pull the corresponding col code value from DF Two into DF One. Each code number is set to a name and item. The results should look like this

DF One: End Result

       col_name   col_item     new_col
   0    jo         abc           123
   1    bo         efg           789
   2    mo         xyz           456

Here is the code I've been trying. My logic is if name and item from DF Two match name and item from DF One, pull col code value from DF Two into DF One

if df_two[' col name '] == df_one['col name '] & df_two [' col item '] == df_one[' col item ']:
    df_one['new col'] = df_two['col code']

df_one.head(5).to_dict

{'Date': {0: '8/24/2021',
   1: '8/17/2021',
   2: '8/19/2021',
   3: '8/19/2021',
   4: '8/19/2021'},
  'ID Number': {0: 123213,
   1: 4564564',
   2: '789789',
   3: '735764',
   4: '1010101'},
  'col name': {0: 'mo',
   1: 'bo',
   2: 'jo,LLP',
   3: 'do, LLP',
   4: 'to, LLP'},
  'Amount': {0: 900.0, 1: 105.0, 2: 1.02, 3: 132.0, 4: 8.0},
  'Tax': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
  'col item': {0: ' abc - bc - zxy ',
   1: ' cba - abc – zx ',
   2: ' hij ',
   3: ' lmn - op – xyz',
   4: ' lmn - ac – mno'},
  'BBNumber': {0: '30', 1: '30', 2: '30', 3: '30', 4: '30'}}

df_two.head(5).to_dict

{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
  'col name': {0: 'mo',
   1: 'bo',
   2: 'jo,LLP',
   3: 'do, LLP',
   4: 'to, LLP'},
  'col code': {0: 123, 1: 456, 2: 789, 3: 987, 4: 654},
  'col item': {0: ' abc - bc - zxy, lmn - ac – mno, cba’,
   1: cba - abc – zx, lmn - op - xyz',
   2: 'hij , qrx',
   3: ' lmn - op – xyz, abc’,
   4: 'lmn - ac – mno’}}

Should the end result be only 2 rows ? `bo efg` in `df_one` is not found in `df_two` if you are to match both columns — SeaBean, Sep 09 '21 at 20:07
There should be 3 rows. Looks like I messed up the `col_item` values when I was making the example data frames. I will update the post. — tibaH_lluN, Sep 09 '21 at 20:20
Updated my solution output according to your sample data update. Now show 3 rows. — SeaBean, Sep 09 '21 at 20:26
Please also check the expected values of `new_col` for the last 2 rows. Are they swapped ? `bo efg` should match 789, right ? — SeaBean, Sep 09 '21 at 20:26
Correct. Yes they were switched. The order shouldn't matter just need the `col_code` values to be returned for the name and item that match in each DF. The codes are assigned to each individual name and item. — tibaH_lluN, Sep 09 '21 at 20:41

SeaBean · Accepted Answer · 2021-09-11T12:34:52.910

1

You can try to split the string in col_item in df_two by str.split() into lists, then use .explode() to create rows with each item in the list in one row. Then, we .merge() with df_one on the 2 columns. Finally, rename the column col_code to new_col by .rename(), as follows:

df2a = df_two.assign(col_item=df_two['col_item'].str.split(r'\s*,\s*')).explode('col_item')

(df_one.merge(df2a, on=['col_name', 'col_item'])
       .rename({'col_code': 'new_col'}, axis=1)
)

Result:

  col_name col_item  new_col
0       jo      abc      123
1       bo      efg      789
2       mo      xyz      456

Edit

Here is a version without using .assign():

df2a = df_two.copy()
df2a['col_item'] = df2a['col_item'].str.split(r'\s*,\s*')
df2a = df2a.explode('col_item')

(df_one.merge(df2a, on=['col_name', 'col_item'])
       .rename({'col_code': 'new_col'}, axis=1)
)

Edit 2

As revealed from your dataset dumps by to_dict(), your strings in column col item in source dataframes df_one and df_two actually contain some leading and/or trailing spaces. This is the reason why some strings from the 2 dataframes cannot be successfully matched.

E.g. in df_one, you have string ' abc - bc - zxy ' (with spaces at both ends), while from df_two, you have string ' abc - bc - zxy, lmn - ac – mno, cba' (the first part before comma has only one space before it but no space before the comma). Hence, we can't match these strings without stripping the extra white spaces at both ends.

Here's the revised codes added with codes to strip the extra white spaces at both ends for column col item for both dataframes:

df2a = df_two.copy()
df2a['col item'] = df2a['col item'].str.split(r'\s*,\s*')
df2a = df2a.explode('col item')
df2a['col item'] = df2a['col item'].str.strip()   # strip extra white spaces at both ends 

df1a = df_one.copy()
df1a['col item'] = df1a['col item'].str.strip()   # strip extra white spaces at both ends 

df_out = (df1a.merge(df2a, on=['col name', 'col item'])
              .rename({'col code': 'new_col'}, axis=1)
         )

Data Input

(With rectification of typos in your to_dict() data dumps)

d1 = {'Date': {0: '8/24/2021', 1: '8/17/2021', 2: '8/19/2021', 3: '8/19/2021', 4: '8/19/2021'},
  'ID Number': {0: '123213', 1: '4564564', 2: '789789', 3: '735764', 4: '1010101'},
  'col name': {0: 'mo', 1: 'bo', 2: 'jo,LLP', 3: 'do, LLP', 4: 'to, LLP'},
  'Amount': {0: 900.0, 1: 105.0, 2: 1.02, 3: 132.0, 4: 8.0},
  'Tax': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
  'col item': {0: ' abc - bc - zxy ', 
               1: ' cba - abc – zx ', 
               2: ' hij ', 
               3: ' lmn - op – xyz', 
               4: ' lmn - ac – mno'},
  'BBNumber': {0: '30', 1: '30', 2: '30', 3: '30', 4: '30'}}
df_one = pd.DataFrame(d1)

        Date ID Number col name  Amount  Tax          col item BBNumber
0  8/24/2021    123213       mo  900.00  0.0   abc - bc - zxy        30
1  8/17/2021   4564564       bo  105.00  0.0   cba - abc – zx        30
2  8/19/2021    789789   jo,LLP    1.02  0.0              hij        30
3  8/19/2021    735764  do, LLP  132.00  0.0    lmn - op – xyz       30
4  8/19/2021   1010101  to, LLP    8.00  0.0    lmn - ac – mno       30

d2 = {'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
  'col name': {0: 'mo', 1: 'bo', 2: 'jo,LLP', 3: 'do, LLP', 4: 'to, LLP'}, 
  'col code': {0: 123, 1: 456, 2: 789, 3: 987, 4: 654},
  'col item': {0: ' abc - bc - zxy, lmn - ac – mno, cba', 
               1: 'cba - abc – zx, lmn - op - xyz', 
               2: 'hij , qrx', 
               3: ' lmn - op – xyz, abc', 
               4: 'lmn - ac – mno'}}
df_two = pd.DataFrame(d2)

   Unnamed: 0 col name  col code                              col item
0           0       mo       123   abc - bc - zxy, lmn - ac – mno, cba
1           1       bo       456        cba - abc – zx, lmn - op - xyz
2           2   jo,LLP       789                             hij , qrx
3           3  do, LLP       987                   lmn - op – xyz, abc
4           4  to, LLP       654                        lmn - ac – mno

Note that there is already a column Unnamed: 0 in your source dataframe. It is NOT introduced by running the solution codes.

Output

print(df_out)


        Date ID Number col name  Amount  Tax        col item BBNumber  Unnamed: 0  new_col
0  8/24/2021    123213       mo  900.00  0.0  abc - bc - zxy       30           0      123
1  8/17/2021   4564564       bo  105.00  0.0  cba - abc – zx       30           1      456
2  8/19/2021    789789   jo,LLP    1.02  0.0             hij       30           2      789
3  8/19/2021    735764  do, LLP  132.00  0.0  lmn - op – xyz       30           3      987
4  8/19/2021   1010101  to, LLP    8.00  0.0  lmn - ac – mno       30           4      654

You can also extract only some of the result columns, for example:

df_out2 = df_out[['col name', 'col item', 'new_col']]

Result:

print(df_out2)

  col name        col item  new_col
0       mo  abc - bc - zxy      123
1       bo  cba - abc – zx      456
2   jo,LLP             hij      789
3  do, LLP  lmn - op – xyz      987
4  to, LLP  lmn - ac – mno      654

edited Sep 11 '21 at 12:34

answered Sep 09 '21 at 20:06

SeaBean

22,547
3
13
25

This seems really close. Having issues with `(col_item=df_two['col_item']` section. Its throwing syntax error. – tibaH_lluN Sep 09 '21 at 20:47
@tibaH_lluN Do you use the exact codes or modify it ? Is that your column name for `col_item` actually contains spaces ? – SeaBean Sep 09 '21 at 20:49
@tibaH_lluN See my edit above. Added a version without using `.assign()`. Please retry. Previously used `.assign()` because don't want to change your original data in `df_two`. – SeaBean Sep 09 '21 at 20:57
@tibaH_lluN Ok, just as I guessed. You can do it with the edited version without problem. – SeaBean Sep 09 '21 at 21:00
@tibaH_lluN How's the edited version ? Work for you ? – SeaBean Sep 09 '21 at 21:20
The edited version works alot better! The issue now is the merge seemed to double the objects in the DF and it created a random column labeled `Unnamed: 0` not sure what its referencing. However, the codes did populate in `new_col` – tibaH_lluN Sep 10 '21 at 14:14
@tibaH_lluN What do you mean by "double the objects in the DF" ? The case seems to be owing to your sample data does not really resemble your true data. Would you please update your sample data to better represent the structure of your real data. You can post your minimal data by `df_one.head(5).to_dict()`. Similarly for `df_two`. Posting the data is necessary for us to help you, since now your hand crafted data does not really represent your true data and resulted in unexpected result. With your sample data, the code should run well. You can also run my codes with your sample data. – SeaBean Sep 10 '21 at 15:32
Have updated the the code to align more with actual data set. Also the wording of my question, hopefully that helps explain what I'm trying to do better. – tibaH_lluN Sep 10 '21 at 16:03
@tibaH_lluN Your edited sample data mainly only have the column label changes. It will not explain why there are so called `"double the objects in the DF"` and explain why there is extra column `Unnamed: 0` coming out. You can just amend my codes with the sample data to reflect the changes on the column labels and **run on your sample data** to see the result should be fine without the problem you described. The only changes would be just column label changes. We can't help you further if you don't show your real data by e.g. `df_one.head(5).to_dict()` – SeaBean Sep 10 '21 at 16:38
@tibaH_lluN Please run your sample data with my codes (with the column labels changed to reflect your recent update, of course). You will see that the results is exactly your expected result without problem. I actually run this. So, your described problem must be your real data is not the same structure in your sample data. This is why I need you to show your actual data by `df_one.head(5).to_dict()`. You can mask some sensitive data if you like. E.g. mask customer name, or id to some fake values. – SeaBean Sep 10 '21 at 16:49
@tibaH_lluN Could you please also give the similar output for `df_two` ? – SeaBean Sep 10 '21 at 19:10
Date ID Number col name Total / 0 8/24/2021 123456 mo 50.00 1 8/17/2021 789456 jo 55.00` 2 8/19/2021 1010 bo 25.02 Tax col item BBNumber 0 0.0 abc 100 1 0.0 xyz 100 2 0.0 zyx 100 results for df_one.head(5).to_dict() – tibaH_lluN Sep 10 '21 at 19:10
@tibaH_lluN So, what you dump the data by `to_dict` is quite different from the sample data. You don't have a comma separated list of string in column `col item` at all, right ? – SeaBean Sep 10 '21 at 19:15
@tibaH_lluN Is it because it is `df_one` so without the comma separated list ? Please also show `df_two`. Thanks. – SeaBean Sep 10 '21 at 19:18
`df_two.head(5).to_dict()` results col name col code/ 0 mo 123 1 jo 456 2 bo 789 col item 0 abc 1 xyz ,abc 2 zyx, cba , wxy – tibaH_lluN Sep 10 '21 at 19:22
@tibaH_lluN Actually, what you provided is not an output from `to_dict()`. The output should be a dict. We need a dict to be more precise to see the structure. What you provided is just like 3 lines of the sample data above. Not much more information. Please really take the `to_dict` output – SeaBean Sep 10 '21 at 19:22
Do to spacing / character limits I will take a screen shot of each `to_dict` and edit post. – tibaH_lluN Sep 10 '21 at 19:28
@tibaH_lluN Don't take screen shot. Copy the text into your question. Otherwise, I won't be able to copy and paste to test it. – SeaBean Sep 10 '21 at 19:28
@tibaH_lluN Actually, I took what you just sent to be as the dump of `df_one` and `df_two` with more columns. They are all run with the above codes without problem. So, the root cause can't be found just by your pure text sample. It has to be the dump in dict format that really tell the dataframe structure. Otherwise, we won't be able to help you further. – SeaBean Sep 10 '21 at 20:18
Have added the text outputs for the `.to_dict ` of the data frame. Had to edit out sensitive data but kept the formats the same. – tibaH_lluN Sep 10 '21 at 20:20
@tibaH_lluN You have been missing the closing brackets. Should be `df_one.head(5).to_dict()` (with a pair of brackets at the end) instead of just `df_one.head(5).to_dict` – SeaBean Sep 10 '21 at 20:21
@tibaH_lluN The output should be in Python dictionary format like this: `{'Date': {0: '8/24/2021', 1: '8/17/2021', 2: '8/19/2021'}, 'ID Number': {0: 123456, 1: 789456, 2: 1010}, 'col name': {0: 'mo', 1: 'jo', 2: 'bo'}, 'Total': {0: 50.0, 1: 55.0, 2: 25.02}, 'Tax': {0: 0.0, 1: 0.0, 2: 0.0}, 'col item': {0: 'abc', 1: 'xyz', 2: 'zyx'}, 'BBNumber': {0: 100, 1: 100, 2: 100}}` – SeaBean Sep 10 '21 at 20:22
@tibaH_lluN As seen from your unsuccessful attempt of dumping `df_two`, there is already something like: `Unnamed: 0` This is your source dataframe. So, your dataframe already has this. – SeaBean Sep 10 '21 at 20:25
@tibaH_lluN From your `df_two` trial dump above, there are far more 3-characters symbols than the sample data in the initial question. However, some are separated by hyphen `-` instead of comma, So, you only need to split by comma right ? I suppose you have no need to split by hyphen as you have not mentioned this in your initial sample data. – SeaBean Sep 10 '21 at 20:31
@tibaH_lluN Please continue to properly dump your 2 sources dataframes to `to_dict()` (don't the pair of brackets at the end). And also give the expected output based on these samples (you can type it in the question for this expected output). I will check for your further inputs tomorrow and try best to solve it provided that you can supply enough information. – SeaBean Sep 10 '21 at 20:35
I appreciate your help! I have corrected the `.to_dict()` to proper output. Yes, I forgot to mention the hyphen separated words. You are correct, just need them separated by the commas the `-` are still same word only commas separate the words. Let me know if there is anything unclear and I will do my best to update and explain. – tibaH_lluN Sep 10 '21 at 20:56
Thank you for the assistance! This seems to be working now. – tibaH_lluN Sep 13 '21 at 17:24
@tibaH_lluN Glad the hidden problem hindering the solution to work had finally been solved. Happy programming! :-) – SeaBean Sep 13 '21 at 18:02

score 0 · Answer 2 · answered Sep 09 '21 at 20:00

0

If you want to add a new column to an existing dataframe you need to use assign, which will add a pandas Series as a new column. Also, the output of df_two [' col_item '] == df_one[' col_item ']

should be

0    False
1    False
2    False
Name: col_item, dtype: bool

The shapes of the columns you are comparing are the same.

answered Sep 09 '21 at 20:00

Harrison H

30
6

Thanks for link, Ill use assign to create new column in DF One. Im mainly trying to grab the `col_code` values and fill them into new DF One column for matching names and items, each code is assigned to a name and item. Thats why Im checking if they match or not. – tibaH_lluN Sep 09 '21 at 20:14

score 0 · Answer 3 · answered Sep 09 '21 at 21:00

import pandas as pd

df1 = pd.DataFrame({'col_name': ["jo", "bo", "mo"], 'col_item': ['abc', 'efg', 'xyz']})
df2 = pd.DataFrame({'col_name': ["jo", "bo"], 'col_item': ["abc,xyz", "efg,xyz,zyx"], 'col_code': ["123", "456"]})

print(df1)
print(df2)

You should merge the two frames on the common column. See https://stackoverflow.com/a/53645883/496289.
```
  df3 = df1.merge(df2, on='col_name')
  print(df3)
```

Remove unwanted rows.

  def is_col1_in_col2(row):
      return row.col_item_x in row.col_item_y

  df4 = df3[df3.apply(is_col1_in_col2, axis=1)]
  print(df4)

Remove unwanted columns.

  df5 = df4.drop('col_item_y', 1).rename(columns={'col_item_x':'col_item'})
  print(df5)

output:

df1
   col_name col_item
0       jo      abc
1       bo      efg
2       mo      xyz

df2
   col_name     col_item col_code
0       jo      abc,xyz      123
1       bo  efg,xyz,zyx      456

df3
   col_name col_item_x   col_item_y col_code
0       jo        abc      abc,xyz      123
1       bo        efg  efg,xyz,zyx      456

df4
   col_name col_item_x   col_item_y col_code
0       jo        abc      abc,xyz      123
1       bo        efg  efg,xyz,zyx      456

df5
   col_name col_item col_code
0       jo      abc      123
1       bo      efg      456

The issue with this is the `merge ` doesn't set the `col_code` values based off the `col_name` and `col_item`. Each code is assigned to matching name and item. Im checking if the name and item match then returning code that aligns with matching name and item. — tibaH_lluN, Sep 10 '21 at 15:32
Given that merge is performed `on='col_name'`, it does set the value based off that column. Anyway the output is as expected in OP, but point out what's not as expected if so. — Kashyap, Sep 29 '21 at 14:26

Comparing & matching comma separated strings from different dataframes Python pandas

3 Answers3

Edit

Edit 2