Dynamically do a join on pandas dataframes

Question

I want to filter on a column and then dynamically join resulting dataframes. My naive approach is; given a dataframe, write a function that filters based on values in a column to get smaller then join. But I don't know how to join dynamically. Any better way of doing this?

data = {'name': ['Jason', 'Molly', 'Jason', 'Jason', 'Molly'],
        'year': [2012, 2012, 2013, 2014, 2014],
        'sale': [41, 24, 31, 32, 31]}

df = pd.DataFrame(data)
print df

def joinDF(df):
    unique_yr = df.year.unique().tolist()  
    i = 1
    for yr in unique_yr:
        df1 = df.loc[df['year'] == yr]
        if len(df.index) != 0:
           #make columns unique then join on name
           df1[['year'+ str(i),'sale'+ str(i), 'name']] = df1[['year','sale','name']]
           i+=1
           print df1

joinDF(df)

   sale name  year
0   41  Jason  2012
1   24  Molly  2012
2   31  Jason  2013
3   32  Jason  2014
4   31  Molly  2014

   sale1 name   year1
0   41   Jason  2012
1   24   Molly  2012

   sale2 name   year2
2   31   Jason  2013

   sale3 name   year3
3   32   Jason  2014
4   31   Molly  2014

Doing ajoin, resulting output dataframe should look like this:

    sale1 name1  year1  sale2   year2  sale3  year3
0   41   Jason  2012   31     2013   32    2014
1   24   Molly  2012   NA     NA     31    2014

It's not clear what you're asking. What dataframe would you like as an output? And where did the `age` information come from? — IanS, May 27 '16 at 07:31
@IanS sorry my bad, I changed it to sale, no age. The output should be a join of all three, as shown — ArchieTiger, May 27 '16 at 07:38

score 1 · Accepted Answer · edited May 23 '17 at 11:58

1

You can use factorize with pivot_table, df is sorted by column year:

df['groups'] = (pd.factorize(df.year)[0] + 1).astype(str)

df1 = (df.pivot_table(index='name', columns='groups', values=['sale', 'year']))
df1.columns = [''.join(col) for col in df1.columns]
print (df1)
       sale1  sale2  sale3   year1   year2   year3
name                                              
Jason   41.0   31.0   32.0  2012.0  2013.0  2014.0
Molly   24.0    NaN   31.0  2012.0     NaN  2014.0

But pivot_table uses aggfunc, default is aggfunc=np.mean if duplicates. Better explanation with sample is here and in docs.

edited May 23 '17 at 11:58

Community

1
1

answered May 27 '16 at 07:41

jezrael

822,522
95
1,334
1,252

beautiful solution @jezrael. Thank you! – ArchieTiger May 27 '16 at 10:54
how to include only sale1 in the dataframe, rather than sale1, sale2, sale3? – ArchieTiger May 30 '16 at 04:26
Do you think only `sale1` and `year1` ? – jezrael May 30 '16 at 05:36
I am not sure, what you need exactly - filter by year is `print (df[df.year == 2012])`, remove others sale `print (df1.drop(['sale2','sale3'], axis=1))`, filter only sale1 `(df1[['sale1']])`. If you need something different, can you explain more? – jezrael May 30 '16 at 05:54

Julien Marrec · Answer 2 · 2016-05-27T07:52:10.633

0

If you absolutely need the output in this repeated saleX, nameX format, @jezrael nailed it I think.

But you might want to do a simplerpivot instead, it'll be a lot less awkward to work with.

In [1]: pivot = df.pivot(index='name',columns='year', values='sale')
        print(pivot)
Out[1]: 
year   2012  2013  2014
name                   
Jason  41.0  31.0  32.0
Molly  24.0   NaN  31.0

edited May 27 '16 at 07:52

answered May 27 '16 at 07:46

Julien Marrec

11,605
4
46
63

Yes, but I think this format is different as want `OP`. – jezrael May 27 '16 at 07:47
That's exactly why I start the answer by saying that if he absolutely needs the output like this he should look at your answer. doesn't hurt to propose a potentially healthier alternative – Julien Marrec May 27 '16 at 07:53

Dynamically do a join on pandas dataframes

2 Answers2