4

I have one data frame df:

   fruit      date    volume
0  apple    20141001    2000
1  apple    20141101    1800
2  apple    20141201    2200
3  orange   20141001    1900
4  orange   20141101    2000
5  orange   20141201    3000
….

and I have following two data frames

apple:

   date       price
0  20141001   2
1  20141101   2.5
2  20141201   3

orange:

   date       price
0  20141001   1.5
1  20141101   2
2  20141201   2

how can I merge all these in to the following data frame:

   fruit      date    price    volume
0  apple    20141001   2       2000
1  apple    20141101   2.5     1800
2  apple    20141201   3       2200
3  orange   20141001   1.5     1900
4  orange   20141101   2       2000
5  orange   20141201   2       3000
….

This is just a example, in my real work, I have hundreds of 'fruit' with price data need to be merged into the first data frame.

should I use merge or join? what is the difference between them? Thank you.

EdChum
  • 376,765
  • 198
  • 813
  • 562
user6396
  • 1,832
  • 6
  • 23
  • 38
  • `join` uses `merge` underneath, as a rule of thumb if you are going to join on indices then use `join` if not then use `merge` – EdChum Feb 03 '15 at 13:25

1 Answers1

6

For your sample data you can achieve what you want by performing concat twice, this assumes that the last 2 dfs align with the master df. The inner concat concatenates the 2 supplemnentary dfs into a single df row-wise, the outer concat concatenates column-wise:

In [56]:
# this concats the 2 supplementary dfs row-wise into a single df
pd.concat([df1,df2], ignore_index=True)
Out[56]:
       date  price
0  20141001    2.0
1  20141101    2.5
2  20141201    3.0
3  20141001    1.5
4  20141101    2.0
5  20141201    2.0
In [54]:
# now concat column-wise with the main df
pd.concat([df,pd.concat([df1,df2], ignore_index=True)], axis=1)
Out[54]:
    fruit      date  volume      date  price
0   apple  20141001    2000  20141001    2.0
1   apple  20141101    1800  20141101    2.5
2   apple  20141201    2200  20141201    3.0
3  orange  20141001    1900  20141001    1.5
4  orange  20141101    2000  20141101    2.0
5  orange  20141201    3000  20141201    2.0

However, for your real data what you will need to do is to add the price column for each fruit:

In [55]:

df[df['fruit'] == 'apple'].merge(df1, on='date')
Out[55]:
   fruit      date  volume  price
0  apple  20141001    2000    2.0
1  apple  20141101    1800    2.5
2  apple  20141201    2200    3.0

and repeat again for each fruit

An approach to your real data problem would be to add a 'fruit' column to each supplemental df, concatenate all these and then merge back using 'fruit' and 'date' columns as the keys:

In [57]:

df1['fruit'] = 'apple'
df2['fruit'] = 'orange'
fruit_df = pd.concat([df1,df2], ignore_index=True)
fruit_df
Out[57]:
       date  price   fruit
0  20141001    2.0   apple
1  20141101    2.5   apple
2  20141201    3.0   apple
3  20141001    1.5  orange
4  20141101    2.0  orange
5  20141201    2.0  orange
In [58]:

df.merge(fruit_df, on=['fruit', 'date'])
Out[58]:
    fruit      date  volume  price
0   apple  20141001    2000    2.0
1   apple  20141101    1800    2.5
2   apple  20141201    2200    3.0
3  orange  20141001    1900    1.5
4  orange  20141101    2000    2.0
5  orange  20141201    3000    2.0
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Is it possible to use groupby? – user6396 Feb 03 '15 at 13:58
  • @user6396 you could groupby the fruit and date and then set the price, your problem though is that if you have lots of dfs with no identifier in them (i.e. a fruit column) then you can't merge them back without setting all the rows that match the dates with the same price, What you could do is add the fruit column for each supplementary df, concatenate all these supplementary dfs and then you can merge them back using the fruit column and date column, that would be how I would approach this – EdChum Feb 03 '15 at 14:01