0

I am working with a few csv datasets in order to create a new synthesized output that tells the user what data types from certain surveys need to be archived. After some normalizing and merges, I'm left with two final data frames to merge:

df1
    Cruise ID   needs_ctd   needs_adcp
0   1505          FALSE         TRUE
1   1506          FALSE         TRUE

df2
    Cruise ID   needs_wc    WC Instrument
0   NF1505         TRUE         EM710
1   NF1505         TRUE         Reson7125
2   NF1506         TRUE         EK60

Currently, I'm merging using: df_out = df1.merge(df2, how="left", on="Cruise ID")

Which gives the following result:

df_out
    Cruise ID   needs_ctd   needs_adcp  needs_wc    WC Instrument 
0   1505           FALSE        TRUE      TRUE          EM710
1   1505           FALSE        TRUE      TRUE          Reson7125
2   1506           FALSE        TRUE      TRUE          EK60

The problem here is that it can create some confusion for the user who might get confused about why "needs_adcp" is being repeated on two lines. So I'd like to instead be able to move the second WC Instrument information to new columns for 1505.

What I'd like to see instead:

df_out
    Cruise ID   needs_ctd   needs_adcp  needs_wc    WC Instrument   needs_wc2   WC Instrument2
0   1505           FALSE        TRUE      TRUE          EM710        TRUE           Reson7125
1   1506           FALSE        TRUE      TRUE          EK60    

Thank you for your help!

2 Answers2

0

I don't think it's possible to have two columns with the same names "WC Instrument" in dataframe. Maybe combine EM710 and Reson7125 as a list in your df2 so that you have a unique Cruise IDs in df2.

how to combine EM710 and Reson7125 into a list. How to use groupby to concatenate strings in python pandas?

ratneshn
  • 1
  • 2
  • So I can have them be 'needs_wc2' and 'WC Instrument2' or something like that. The csv with the input data actually has the instruments in a list format. The problem is the csv I am comparing against has them separated out as separate archival entries. – oneremainsclear Aug 19 '20 at 17:45
  • can you transform the other one into this format? Cause with this you just have one column as "WC Instrument" with lists, then u just need to check if the list are same. You might not know which one is WC1, WC2 etc etc and that might lead to more issues. Ex; in one it this it WC1 might be EM710, while in the other it might be Reson7125. – ratneshn Aug 19 '20 at 17:51
0

If you are OK to rename the second (and possibly the following) WC Instrument and needs_wc columns you can do something like (I have to admit its a bit far-fetched and there might be a more elegant way to do it) :

df2_reindex = df2.set_index(['Cruise ID', df2.groupby('Cruise ID').cumcount()])
df1.merge(
    df2_reindex['WC Instrument']
         .unstack(fill_value='')
         .add_prefix('WC Instrument_')
         .reset_index()).merge(
        df2_reindex['needs_wc']
             .unstack(fill_value='')
             .add_prefix('needs_wc_')
             .reset_index())

Outputs as expected :

   Cruise ID  needs_ctd  needs_adcp  ... WC Instrument_1 needs_wc_0 needs_wc_1
0      1505      False        True  ...      Reson7125       True       True
1      1506      False        True  ...                      True           

Note that you can make it work without knowing in advance the name of the columns that may need to be created with a suffix with something like this:

res = df1.copy()
df2_reindex = df2.set_index(['Cruise ID', df2.groupby('Cruise ID').cumcount()])
for col in df2_reindex.columns:
    res = res.merge(
        df2_reindex[col]
            .unstack(fill_value='')
            .add_prefix(col + '_')
            .reset_index())
mgc
  • 5,223
  • 1
  • 24
  • 37