1

I have 3 data frames similar to these dataframes in this topic as follows:

A = pd.DataFrame({"Crop_type":["Wheat", "Wheat", "Wheat" ,"Barley", "Barley", "Barley"], "date": ["03/22/2014","04/15/2014","06/17/2014","03/22/2014","04/15/2014", "06/17/2014"], "impact": ["good", "good", "excellent", "fair", "fair", "fair"]})

B = pd.DataFrame({"Crop_type":["Barley", "Barley",], "date":["03/15/2014","06/26/2014"], "value": ["3","5"] })

C = pd.DataFrame({"Crop_type":["Wheat", "Wheat", "Wheat"], "date":["03/15/2014","04/16/2014","06/26/2014"], "value": ["4","8","1"] })

This is the only snippet. Actually, the first data frame has about 4600 rows and other tables about 250 rows. I need to create a new table, where values from data frames B and C come to A, but also with the dates (as second date). I would like to merge it on the nearest date and crop type. (Data frame samples should tell more). Manually in excel would be suicide.

The output should be something like this:

>>> Output_dataset
 Crop_type Date        Date_Nearest  value  impact
0  Wheat   03/22/2014   03/15/2014     4     good
1  Wheat   04/15/2014   04/16/2014     8     good
2  Wheat   06/17/2014   06/26/2014     1    excellent
3  Barley  03/22/2014   03/15/2014     3     fair
4  Barley  04/15/2014      NaN        Nan    fair
5  Barley  06/17/2014   06/26/2014     5     fair

I've already tried Reindex and Merge, however, It seems to me that Python/Pandas has a problem with different length of tables.

ValueError: cannot reindex a non-unique index with a method or limit
Lukáš Tůma
  • 350
  • 2
  • 15

1 Answers1

1

pd.merge_asof allows for a combination of exact merge keys and "near" keys. To only merge each record in B and C with a single record in A you will need to merge B+C with A and then use A to reindex.

#Make Everything Datetime
A['date'] = pd.to_datetime(A.date)
B['date'] = pd.to_datetime(B.date)
C['date'] = pd.to_datetime(C.date)

Left = pd.concat([B, C], sort=False)

result = pd.merge_asof(Left.rename(columns={'date': 'date_nearest'}).sort_values('date_nearest'),
                       A.sort_values('date'),
                       by='Crop_type', 
                       left_on='date_nearest',
                       right_on='date',
                       direction='nearest')

result = (result.set_index(['Crop_type', 'date'])
                .reindex(pd.MultiIndex.from_arrays([A['Crop_type'], A['date']]))
                .reset_index())

  Crop_type       date date_nearest value     impact
0     Wheat 2014-03-22   2014-03-15     4       good
1     Wheat 2014-04-15   2014-04-16     8       good
2     Wheat 2014-06-17   2014-06-26     1  excellent
3    Barley 2014-03-22   2014-03-15     3       fair
4    Barley 2014-04-15          NaT   NaN        NaN
5    Barley 2014-06-17   2014-06-26     5       fair

If A has rows duplicated on ['Crop_type', 'date'] you can add an additional column using A.groupby(['Crop_type', 'date']).cumcount() so that you can reindex.

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Thanks! It works. Going to see difference between merge and merge_asof – Lukáš Tůma Mar 13 '20 at 17:53
  • Hello! I would like to ask you - How to drop duplicate by groupby, cumcount() amd reindex. I have now table with 4633 rows and `A.groupby(['Crop_type', 'date']).cumcount()` is 4512. Thx! – Lukáš Tůma Mar 16 '20 at 21:09