1

Cleaning up sharepoint list for upload to mssql with proper table relationships.

Basically, two dataframes (data, config), both share some common columns (country, business). What I want to do is to insert a new column in datadf where for each row it contains index of matching row in configdf based on values in columns country and business.

dataframe data:

-----|---------|----------|-----
 ... | Country | Business | ...
-----|---------|----------|-----
     |    A    |     1    |
-----|---------|----------|-----
     |    A    |     1    |
-----|---------|----------|-----
     |    A    |     2    |
-----|---------|----------|-----
     |    A    |     2    |
-----|---------|----------|-----
     |    B    |     1    |
-----|---------|----------|-----
     |    B    |     1    |
-----|---------|----------|-----
     |    B    |     2    |
-----|---------|----------|-----
     |    C    |     1    |
-----|---------|----------|-----
     |    C    |     2    |
-----|---------|----------|-----

dataframe config (ID = index):

----|---------|----------|-----
 ID | Country | Business | ...
----|---------|----------|-----
  1 |    A    |     1    |
----|---------|----------|-----
  2 |    A    |     2    |
----|---------|----------|-----
  3 |    B    |     1    |
----|---------|----------|-----
  4 |    B    |     2    |
----|---------|----------|-----
  5 |    C    |     1    |
----|---------|----------|-----
  6 |    C    |     2    |
----|---------|----------|-----

what I want to add to dataframe data:

-----|---------|----------|-----------|-----
 ... | Country | Business | config_ID | ... 
-----|---------|----------|-----------|-----
     |    A    |     1    |     1     |
-----|---------|----------|-----------|-----
     |    A    |     1    |     1     |
-----|---------|----------|-----------|-----
     |    A    |     2    |     2     |
-----|---------|----------|-----------|-----
     |    A    |     2    |     2     |
-----|---------|----------|-----------|-----
     |    B    |     1    |     3     |
-----|---------|----------|-----------|-----
     |    B    |     1    |     3     |
-----|---------|----------|-----------|-----
     |    B    |     2    |     4     |
-----|---------|----------|-----------|-----
     |    C    |     1    |     5     |
-----|---------|----------|-----------|-----
     |    C    |     2    |     6     |
-----|---------|----------|-----------|-----

----Found something that works----

datadf['config_ID'] =  datadf.apply(lambda x: configdf[(configdf.country == x.country) & (configdf.business_unit == x.business_unit)].index[0], axis=1)

It gets the job done, although I am open for other suggestions, especially if it could work with df.insert()

MrPilipo
  • 139
  • 1
  • 1
  • 8
  • 1
    I think you are looking for a join operation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html – julian Dec 03 '19 at 10:22
  • That was my first guess, but after going through documentation I have no idea how to make it work in this case – MrPilipo Dec 03 '19 at 11:07

2 Answers2

0

You can use numpy.where function to match the data frames

For example:

datadf = pd.DataFrame([['USA','Business1'],['AUS','Business2'],['UK','Business3'],['IND','Business4']],
                          columns=['country','business'])
configdf = pd.DataFrame([['AUS','Business2'],['IND','Business4'],['USA','Business1'],['UK','Business3']],
                          columns=['country','business'])

datadf['new_col'] = datadf.apply(lambda x: (np.where(x == configdf)[0][0]),axis=1)
print(datadf)

Output:

  country   business  new_col
0     USA  Business1        2
1     AUS  Business2        0
2      UK  Business3        3
3     IND  Business4        1

EDIT1:

Well, in that case, you can use

datadf['new_col'] = datadf.apply(lambda x: (np.where((x['country'] == configdf['country']) & (x['business'] == configdf['business']))[0][0]),axis=1)

Output based on your sample data frames datadf and configdf:

  country business  new_col
0       A        1        0
1       A        1        0
2       A        2        1
3       A        2        1
4       B        1        2
5       B        1        2
6       B        2        3
7       C        1        4
8       C        2        5
davidbilla
  • 2,120
  • 1
  • 15
  • 26
  • Close... it returns list of rows where either country or business matches. While country & business sets in configdf are unique, country and business entries alone are not. Added data structure sample for reference. – MrPilipo Dec 03 '19 at 11:14
  • @MrPilipo the same should work, please see EDIT 1 above – davidbilla Dec 03 '19 at 11:35
  • Thanks, ended up using `datadf['config_ID'] = datadf.apply(lambda x: configdf[(configdf.country == x.country) & (configdf.business_unit == x.business_unit)].index[0], axis=1)` but your answer guided me to it. Will do some testing later to see if there is any significant advantage of one solution over another. – MrPilipo Dec 03 '19 at 12:38
  • I think the merge solution by @julian is better unless you dont care care about resetting the index – davidbilla Dec 03 '19 at 12:47
  • I care about getting indexes from the 2nd df, that's all the data I need from it. merge() and join() were the first things I've tried and those do not give what I expect. – MrPilipo Dec 03 '19 at 12:57
0

Here is a solution using pandas merge.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html#pandas.DataFrame.merge

import pandas as pd

# make the two dataframes
data = pd.DataFrame({'Country':['A','A','A','A','B','B','B','C','C'],
                     'Business':[1,1,2,2,1,1,2,1,2]})

configdf = pd.DataFrame({'Country':['A','A','B','B','C','C'],
                         'Business':[1,2,1,2,1,2]})

# make a column with the index values
configdf.reset_index(inplace=True)

# merge the two dataframes based on the selected columns.
newdf = data.merge(configdf, on=['Country', 'Business'])
julian
  • 451
  • 2
  • 8
  • Thanks, but that was one of the first things I've tried, doesn't give expected results. – MrPilipo Dec 03 '19 at 12:59
  • If you execute the exact code above then the result is the same as you have stated the desired output to be in your question. What exactly have you tried and how is it different to what you want? – julian Dec 04 '19 at 13:22