-1

I am trying to learn beginner data science and I have 2 datasets from which the 1st one is:

+----+-------+--------+------+------+-------+
| ID | bool  |  num1  |  A   |  B   | event |
+----+-------+--------+------+------+-------+
| a1 | TRUE  | 123456 | 1001 | 1003 |     0 |
| a2 | FALSE | 123456 | 1006 | 1009 |     1 |
| a3 | TRUE  | 144444 | 1020 | 1022 |     2 |
+----+-------+--------+------+------+-------+

and the 2nd one:

+----+--------+-------+------+----------+------+-------+------+
| ID |  num1  | event |  C   | category | num2 | num3  | num4 |
+----+--------+-------+------+----------+------+-------+------+
| a1 | 123456 |     0 | 1002 | aa       | 1.11 | -1.01 | 1.23 |
| a1 | 123456 |     0 | 1003 | bb       | 3.21 |  2.92 | 4.03 |
| a2 | 144444 |     1 | 1008 | aa       | 6.34 |  5.56 | 7.02 |
| a2 | 144444 |     1 | 1009 | aa       | 5.65 |  3.99 | 6.32 |
+----+--------+-------+------+----------+------+-------+------+

From them I want to make the 3rd one like this where data is event column based:

+-------+----+-------+--------+-----------+------------+------------+------------+----------+----------+
| event | ID | bool  |  num1  | C values  | count cat1 | count cat2 | count cat3 | min num2 | avg num2 |
+-------+----+-------+--------+-----------+------------+------------+------------+----------+----------+
|     0 | a1 | TRUE  | 123456 | 1002:1003 |          1 |          1 |          0 |     1.11 |     2.16 |
|     1 | a2 | FALSE | 123456 | 1008:1009 |          2 |          0 |          0 |     5.65 |    5.995 |
|     2 | a3 | TRUE  | 144444 | 1020      |          0 |          0 |          1 |     4.02 |     4.02 |
+-------+----+-------+--------+-----------+------------+------------+------------+----------+----------+

This is a reduced example. I have read about stacking, groupby, count based on another column, numpy.where, reshaping etc. but I've failed to combine them to achieve anything similar to what I want. What are the suggestions to solve this, starting from simple ones? Different solutions are welcome so I could try and understand them all. Using Python, Pandas.

ktii
  • 59
  • 7
  • Can you post the code that you have tried? – harvpan Jul 19 '18 at 16:54
  • I tried [this](https://stackoverflow.com/a/39608197/8037821), [this](https://stackoverflow.com/a/38925156/8037821), [this](https://stackoverflow.com/a/47359023/8037821), tried to figure out the 'opposite' of [this](https://stackoverflow.com/a/48789339/8037821), [this](https://stackoverflow.com/a/22391554/8037821), [this](https://stackoverflow.com/a/39722273/8037821), In2 from [this](https://pandas.pydata.org/pandas-docs/stable/reshaping.html), [this](https://stackoverflow.com/a/46942684/8037821) and some more but I don't have my tries left any more as they resulted in errors or nonsense. – ktii Jul 20 '18 at 08:12

3 Answers3

0

For a simple case, you can just concatenate the dataframes along

pd.concat([df1, df2], axis =1)
shmit
  • 2,306
  • 2
  • 16
  • 20
  • This is good for learning concatenate, thanks! But it doesn't take into account that I needed event-column based result. Rohith Pinnamaraju answer helped me understand the idea how to achieve it. – ktii Jul 20 '18 at 07:47
0

You can try the following:

#Use pd.get_dummies to create category counts on joined tables with merge
df_out = pd.get_dummies(df1.merge(df2, 
                                  on=['ID'], 
                                  how='left', 
                                  suffixes=('','_y')), 
                        columns=['category'], 
                        prefix='cat', 
                        prefix_sep='_')

#compile a list of newly create columns from pd.get_dummies the category count columns
catcols = df_out.filter(like='cat_').columns.values.tolist()
#create a dictionary for agg function
aggdict = dict(zip(catcols,['sum']*len(catcols)))

#add to dictionary custom aggregrations for other columns
aggdict['C'] = ['min','max']
aggdict['num2'] = 'min'

#add other columns to column list
catcols.append('C')
catcols.append('num2')

#Groupby and flatten multiindex column headers
df_out = df_out.groupby(['event','ID','bool','num1'])[catcols].agg(aggdict)
df_out.columns = df_out.columns.map('_'.join)
print(df_out.reset_index())

Output:

   event    ID     bool      num1  cat_aa_sum  cat_bb_sum   C_min   C_max  num2_min
0    0.0   a1    TRUE    123456.0           1           1  1002.0  1003.0      1.11
1    1.0   a2    FALSE   123456.0           2           0  1008.0  1009.0      5.65
2    2.0   a3    TRUE    144444.0           0           0     NaN     NaN       NaN
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
0

Crude way, but I guess this should work.

df_in = pd.merge(df1, df2, how="outer")
def func(x):
    x["C_values"] = ":".join(str(int(i)) for i in list(x.C) if str(i)!="nan")
    x["count_cat1"] = list(x.category).count("aa")
    x["count_cat2"] = list(x.category).count("bb")
    x["count_cat3"] = list(x.category).count("cc")
    num2_list = [i for i in list(x.num2) if str(i)!="nan"]
    if num2_list != []:
        x["min_num2"] = min(num2_list)
        x["avg_num2"] = mean(num2_list)
    else:
        x["min_num2"] = None
        x["avg_num2"] = None

    return x[["event","ID","bool","num1","C_values","count_cat1","count_cat2",
                        "count_cat3","min_num2","avg_num2"]].iloc[0] 

Output:

df_out = df_in.groupby("event", as_index=False).apply(func)

    event   ID  bool    num1    C_values    count_cat1  count_cat2  count_cat3  min_num2    avg_num2
    0       a1  True    123456  1002:1003   1           1            0          1.11            2.16
    1       a2  False   123456  1008:1009   2           0            0          5.65            5.995
    2       a3  True    144444              0           0            0         nan             nan  
Rohith
  • 1,008
  • 3
  • 8
  • 19