0

I am trying to build a loop which will take separate data frames and merge left on a single "master" data frame.

Each of the individual dataframes are out of order and I would like to line them up with values in the CAS column of the master data frame.

Example Master Data Frame

Master Dataframe

Example Data Frame 1:

Dataframe1

Example Data Frame 2:

[Dataframe2]

ACTUAL DATA

Master DataFrameMaster Dataframe

DataFrame1DataFrame1

DataFrame2DataFrame2

  • 1
    Ideally I would like to export the results of the loop to single csv, where CAS and Value align with the master dataframe. Any help is appreciated! – Chris Kubicki Aug 02 '20 at 17:24
  • 1
    Try the [chain merge](https://stackoverflow.com/a/40568957/1422451) after combining all dfs including master into a list setting `CAS` as index on each: `pd.concat([df.set_index('CAS', drop=True) for df in [master] + dataframes], axis=1)` – Parfait Aug 02 '20 at 17:49
  • You should be working in a list of data frames instead of separate objects flooding global environment. Show how you define the 50 separate data frames. – Parfait Aug 02 '20 at 22:52
  • I don't know your complete DataFrame specifications. What columns do they contain? What columns do you want in the DataFrame to be return? It's alright to supply images for your DataFrame structure for clarification, but your data also needs to be supplied in Python code. See [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask). – Jakub Aug 03 '20 at 02:08

2 Answers2

1

Solution 1:
Use if 'value' column only in df1 and df2, but not df_master.

dfcon = pd.concat([df1, df2])
df = pd.merge(df_master, dfcon, how='left', on='CAS')

Solution 2:
Use if 'value' column is also in df_master.

df_master_drop = df_master.drop(columns=['value'])
df_drop = pd.merge(df_master_drop, dfcon, how='left', on='CAS')
df = df_master.combine_first(df_drop)

Notes: Use dfcon = pd.concat([df1, df2]).drop_duplicates('CAS') if there are duplicates. This will preserves earliest CAS value.

Jakub
  • 489
  • 3
  • 13
0

Edit: changed code to include master items without matches in any df

It seems a simple concatenate will do the job (unless there is more data in master that you want to keep):

result = (
    pd.concat(dataframes.append(master_df), axis=0,
        keys=[f'df{i}' for i in range(1, len(dataframes)+1)], names=['source'])
    .sort_values('CAS')
    )
    

The keys parameter is optional, and you definitely didn't ask for it. I included it because it is often useful. It will add a level to the index in the result df so you can track the source or make grouped analysis.

RichieV
  • 5,103
  • 2
  • 11
  • 24
  • Yes, if you want master to be concatenated as well yoy need a new list with all dfs. In this case it seems safe to have many dfs concatenated. But if dfs' columns differ by a lot and they have many rows then you will get a sheet of NaNs. See https://stackoverflow.com/a/49620539/6692898 – RichieV Aug 02 '20 at 18:18
  • This just concatenates the three dataframes and adds a 'source' column. I believe Chris wants to join the values in dataframes 1 and 2 to correspond to the CAS numbers in the master (i.e. add the value column to the master df). – Jakub Aug 03 '20 at 06:38
  • Just made a edit. However, this was already addressed in a comment. @Jakub The goal of this site is to have people help you figure out how you have made a mistake and point out possible solutions, not to write every line of code for you. Not a problem this time, just to keep in mind. – RichieV Aug 03 '20 at 16:56
  • Now I see an edit to the question which makes solution by concatenation by posing a different problem. Now you need to look at concatenating 'dataframes' and then merging with master (unless there is a possibility of having duplicate CAS codes across dfs in 'dataframes', which will pose yet a different problem) – RichieV Aug 03 '20 at 17:05
  • I understand, we all learn to use the site at some point and I am not a years' expert. Don't worry about it, I'm just pointing out suggestions. – RichieV Aug 03 '20 at 18:14
  • @RichieV, you must of misunderstood my comment if you believe my comment was already discussed. Only using concat does not provide the desired solution. Using a merge or join is required as shown in my solution. – Jakub Aug 03 '20 at 19:58