0

I have several tables that look like this:

ID  YY  ZZ
 2  97  826
 2  78  489
 4  47  751
 4  110 322
 6  67  554
 6  88  714

code:

raw = {'ID': [2, 2, 4, 4,  6, 6,],
   'YY': [97,78,47,110,67,88],
   'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)

For each of these dfs, I have to perform a number of operations.

  1. First, group by id,
  2. extract the length of the column zz and average of the column zz,
  3. put results in new df

New df that looks like this

Cities  length  mean
Paris     0      0
Madrid    0      0
Berlin    0      0
Warsaw    0      0
London    0      0

code:

raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin',  'Warsaw', 'London'],
   'length': 0,
   'mean': 0}
df2 = pd.DataFrame(raw2)

I pulled out the average and the size of individual groups

df_grouped = df.groupby('ID').ZZ.size()
df_grouped2 = df.groupby('ID').ZZ.mean()

the problem occurs when trying to transfer results to a new table because it does not contain all the cities and the results must be matched according to the appropriate key.

I tried to use a dictionary:

   dic_cities = {"Paris":df_grouped.loc[2],
             "Madrid":df_grouped.loc[4],
             "Warsaw":df_grouped.loc[6],
             "Berlin":df_grouped.loc[8],
             "London":df_grouped.loc[10]}

Unfortunately, I'm receiving KeyError: 8

I have 19 df's from which I have to extract this data and the final tables have to look like this:

    Cities  length  mean
    Paris   2       657.5
    Madrid  2       536.5
    Berlin  0       0.0
    Warsaw  2       634.0
    London  0       0.0

Does anyone know how to deal with it using groupby and the dictionary or knows a better way to do it?

IMParasharG
  • 1,869
  • 1
  • 15
  • 26
Kepasere
  • 97
  • 5

2 Answers2

1

See this:

import pandas as pd

# setup raw data
raw = {'ID': [2, 2, 4, 4,  6, 6,], 'YY': [97,78,47,110,67,88], 'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)

# get mean values
mean_values = df.groupby('ID').mean()
# drop column
mean_values = mean_values.drop(['YY'], axis=1)
# get occurrence number
occurrence = df.groupby('ID').size()
# save data
result = pd.concat([occurrence, mean_values], axis=1, sort=False)
# rename columns
result.rename(columns={0:'length', 'ZZ':'mean'}, inplace=True)

# city data
raw2 = 'Cities': ['Paris', 'Madrid', 'Berlin',  'Warsaw', 'London'], 'length': 0, 'mean': 0}
df2 = pd.DataFrame(raw2)

# rename indexes
df2 = df2.rename(index={0: 2, 1:4, 2:8, 3:6, 4:10}

# merge data
df2['length'] = result['length']
df2['mean'] = result['mean']

Outout:

    Cities  length   mean
2    Paris     2.0  657.5
4   Madrid     2.0  536.5
8   Berlin     NaN    NaN
6   Warsaw     2.0  634.0
10  London     NaN    NaN
Zaraki Kenpachi
  • 5,510
  • 2
  • 15
  • 38
1

First, you should index df2 on 'Cities':

raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin',  'Warsaw', 'London'],
   'length': 0,
   'mean': 0}
df2 = pd.DataFrame(raw2).set_index('Cities')

Then you should reverse you dictionary:

dic_cities = {2: "Paris",
             4: "Madrid",
             6: "Warsaw",
             8: "Berlin",
             10: "London"}

Once this is done, the processing is as simple as a groupby:

for i, sub in df.groupby('ID'):
    df2.loc[dic_cities[i]] = sub.ZZ.agg([len, np.mean]).tolist()

Which gives for df2:

        length   mean
Cities               
Paris      2.0  657.5
Madrid     2.0  536.5
Berlin     0.0    0.0
Warsaw     2.0  634.0
London     0.0    0.0
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Thank you very much, it works perfectly. But I have one more question, how I can round-up the mean() values to 2 decimal places and to intiger ? – Kepasere Jun 04 '19 at 07:46
  • 1
    `astype` is the way to change a column to integer: `df2['length'] = df2['length].astype(int)`. But rounding a floating value to 2 decimal place is hopeless, just read [Is math floating point broken?](https://stackoverflow.com/q/588004/3545273) to understand why. It is just a display question and depends on how you later process the dataframe. – Serge Ballesta Jun 04 '19 at 07:53