1

I have many dataframes read from pdf files. And they look like this:

e.g order 1 - this is a dataframe:
code    description   price   quantity       
000001  product A       1       10
000002  product B       2       20
000003  product C       3       30 
...

order 2 - this is a dataframe:
code    description   price   quantity
000001  product A       1       100
000002  product B       2       20
000004  product D       4       40

There will be orders 3, 4 etc I like to join them up and group by the code(which is unique) but display separately the quantities.

code    description   price   order1 quantity  order2 quantity
000001  product A       1       10              100
000002  product B       2       20              20
000003  product C       3       30               0 
000004  product D       4        0              40

Apart from using tedious loops, I wonder if there is a cleaner way to achieve this in pandas. I used to process this in excel vba previously using ugly loops and is my first time trying it in pandas.

Thanks alot for any help!

J T
  • 245
  • 2
  • 13
  • You need to concat and drop duplicates. Check [this](https://stackoverflow.com/a/41181829/8353711) – shaik moeed Jul 16 '19 at 06:58
  • Possible duplicate of [Pandas/Python: How to concatenate two dataframes without duplicates?](https://stackoverflow.com/questions/21317384/pandas-python-how-to-concatenate-two-dataframes-without-duplicates) – shaik moeed Jul 16 '19 at 06:59
  • @shaikmoeed - I think OP need soemthing else. – jezrael Jul 16 '19 at 07:08

1 Answers1

1

You can create MultiIndex Series by DataFrame.set_index or aggregate sum in list comprehension and then join together by concat:

dfs = [df1, df2, ..., dfn]

#if triples code, description, price are unique
L = [x.set_index(['code', 'description', 'price'])['quantity'] for x in dfs]

#if triples are not unique and necessary aggregate sum
L = [x.groupby(['code', 'description', 'price'])['quantity'].sum() for x in dfs]

df1 = (pd.concat(L, axis=1, keys=range(1, len(dfs) + 1))
         .add_prefix('quantity')
         .fillna(0)
         .astype(int)
         .reset_index())
print (df1)
     code description  price  quantity1  quantity2
0  000001   product A      1         10        100
1  000002   product B      2         20         20
2  000003   product C      3         30          0
3  000004   product D      4          0         40

EDIT:

For set default column names for each DataFrame in list comprehension is possible by DataFrame.set_axis:

L = [x.set_axis(np.arange(len(x.columns)), axis=1, inplace=False).set_index([0,1,2])[3] 
                                             for x in dfs]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for input, before i can try out these solutions, I need rework my dataframe. Do you know how to remove the annoying column indexing? I mean each row already has a index, so each row is labeled 0 1 2 etc. But I have the same 0,1,2 etc for the columns? I believe this is preventing me from calling .columns and giving my dataframe column names. Would you have any idea? – J T Jul 16 '19 at 21:23
  • 1
    I cannot express my gratitude, you answered me within like 10 minutes of posting my question and this works as per what I wanted. For me to do this in excel vba would have taken me a while and did not want to go down this route. Thanks alot jezrael senpai. – J T Jul 18 '19 at 18:23