4

I tried to answer this question by a group-level merging. The below is a slightly modified version of the same question, but I need the output by a group-level merging.

Here are the input dataframes:

df = pd.DataFrame({ "group":[1,1,1 ,2,2],
                   "cat": ['a', 'b', 'c', 'a', 'c'] ,
                   "value": range(5),
                   "value2": np.array(range(5))* 2})

df

cat group   value value2
a   1         0   0
b   1         1    2
c   1         2    4
a   2         3    6
c   2         4    8

categories = ['a', 'b', 'c', 'd']
categories =  pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
print(categories)

    cat
0   a
1   b
2   c
3   d

Here's the expected output:

cat group   value  value2
a   1         0    0
b   1         1    2
c   1         2    4
d   NA        NA   NA
a   2         3    6
c   2         4    8
b   NA        NA   NA
d   NA        NA   NA

Question:

I can achieve what I want by a for loop. Is there a pandas way to do that though?

(I need to perform an outer join between categories and each group of the groupby result of df.groupby('group'))

grouped = df.groupby('group')

merged_list = []
for g in grouped:
    merged = pd.merge(categories, g[1], how = 'outer', on='cat')
    merged_list.append(merged)

out = pd.concat(merged_list)
akilat90
  • 5,436
  • 7
  • 28
  • 42
  • That's an inefficient way of doing things... why do it? – cs95 Nov 05 '17 at 06:10
  • I thought merging with sub groups is a little readable. And, partly because I didn't get your solution with `reindex` :) Maybe I should read up more on `reindex` – akilat90 Nov 05 '17 at 06:15

2 Answers2

4

I think groupby + merge here is only overcomplicated way for this.

So faster is use reindex by MultiIndex:

mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux).swaplevel(0,1).reset_index()
#add missing values to group column
df['group'] = df['group'].mask(df['value'].isnull())
print (df)
  cat  group  value  value2
0   a    1.0    0.0     0.0
1   b    1.0    1.0     2.0
2   c    1.0    2.0     4.0
3   d    NaN    NaN     NaN
4   a    2.0    3.0     6.0
5   b    NaN    NaN     NaN
6   c    2.0    4.0     8.0
7   d    NaN    NaN     NaN

Possible solution:

df = df.groupby('group', group_keys=False)
       .apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))
  cat  group  value  value2
0   a    1.0    0.0     0.0
1   b    1.0    1.0     2.0
2   c    1.0    2.0     4.0
3   d    NaN    NaN     NaN
0   a    2.0    3.0     6.0
1   b    NaN    NaN     NaN
2   c    2.0    4.0     8.0
3   d    NaN    NaN     NaN

Timings:

np.random.seed(123)
N = 1000000
L = list('abcd') #235,94.1,156ms

df = pd.DataFrame({'cat': np.random.choice(L, N, p=(0.002,0.002,0.005, 0.991)),
                   'group':np.random.randint(10000,size=N),
                   'value':np.random.randint(1000,size=N),
                   'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['group','cat']).drop_duplicates(['group','cat']).reset_index(drop=True)
print (df.head(10))

categories = ['a', 'b', 'c', 'd']

def jez1(df):
    mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
    df = df.set_index(['group','cat']).reindex(mux, fill_value=0).swaplevel(0,1).reset_index()
    df['group'] = df['group'].mask(df['value'].isnull())
    return df

def jez2(df):
    grouped = df.groupby('group')
    categories =  pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
    return grouped.apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))



def coldspeed(df):
    grouped = df.groupby('group')
    categories =  pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
    return pd.concat([g[1].merge(categories, how='outer', on='cat') for g in grouped])

def akilat90(df):
    grouped = df.groupby('group')
    categories =  pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
    merged_list = []

    for g in grouped:
        merged = pd.merge(categories, g[1], how = 'outer', on='cat')
        merged['group'].fillna(merged['group'].mode()[0],inplace=True) # replace the `group` column's `NA`s by mode
        merged.fillna(0, inplace=True)
        merged_list.append(merged)

    return pd.concat(merged_list)

In [471]: %timeit jez1(df)
100 loops, best of 3: 12 ms per loop

In [472]: %timeit jez2(df)
1 loop, best of 3: 14.5 s per loop

In [473]: %timeit coldspeed(df)
1 loop, best of 3: 19.4 s per loop

In [474]: %timeit akilat90(df)
1 loop, best of 3: 22.3 s per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks @jezrael ; even though this doesn't answer the original question. Maybe I'm thinking in a wrong direction and group-level merges aren't actually a thing. – akilat90 Nov 06 '17 at 14:01
  • Yes, I agree. I create only better solution for your problem. Nice day! – jezrael Nov 06 '17 at 14:03
  • I'll keep the question unaccepted until someone writes it that way. Good day to you too! – akilat90 Nov 06 '17 at 14:06
  • Ok, then I add solution for it ;) – jezrael Nov 06 '17 at 14:08
  • Looking forward to it! – akilat90 Nov 06 '17 at 14:09
  • I think first about `reduce`, but it does not work. So is possible use `df = df.groupby('group', group_keys=False).apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))` . If this is what you want I can add it to answer. – jezrael Nov 06 '17 at 14:18
  • Yes! So `apply` treats each group as a separate dataframe and do the merging? Another question, why did you say this is overcomplicated? Isn't this very intuitive? If possible, please add timings. – akilat90 Nov 06 '17 at 14:34
  • I write it bad, sorry. I try correct it I think `groupby` + `merge` here is only overcomplicated way for this.` – jezrael Nov 06 '17 at 14:38
  • Thanks a lot for this anyway! Good day! – akilat90 Nov 06 '17 at 14:44
0

To actually answer your question, no - you can only merge 2 dataframes at a time (I'm not aware of multi-way merges in pandas). You cannot avoid the loop, but you certainly can make your code a little neater.

pd.concat([g[1].merge(categories, how='outer', on='cat') for g in grouped])

  cat  group  value  value2
0   a    1.0    0.0     0.0
1   b    1.0    1.0     2.0
2   c    1.0    2.0     4.0
3   d    NaN    NaN     NaN
0   a    2.0    3.0     6.0
1   c    2.0    4.0     8.0
2   b    NaN    NaN     NaN
3   d    NaN    NaN     NaN
cs95
  • 379,657
  • 97
  • 704
  • 746