-1

I have a very large data file which can be modelled with the code below:

import pandas as pd
import numpy as np
import random

x = pd.DataFrame({'file_name': np.repeat(['csv_1', 'csv_2'], 32),
                  'side': np.tile(np.repeat(['n', 'f'], 16),2),
                  'ten_13':np.tile(np.repeat(['10', '13'], 8),4),
                  'rand_data': np.random.uniform(-1, 1, size=64)})

x = x.set_index(['file_name', 'side', 'ten_13'])
x.head()

I need your help to add a column called web_col using pandas' groupby( ) method to the above DataFrame x.

The web_col column depends on the combination of values in columns side, ten_13 and has one of four possible sequence values:

  • 1:8 if the value of 'side' = 'n' and 'ten_13' = '10'
  • 9:16 if the value of 'side' = 'n' and 'ten_13' = '13'
  • 17:24 if the value of 'side' = 'f' and 'ten_13' = '10'
  • 25:32 if the value of 'side' = 'f' and 'ten_13' = '13'

file_name contains various csv files; in the code below it has only two files. Each file has the same exact values of side and ten_13.

I created this web_col column using the numpy function np.where(), BUT I would like to use pandas' groupby( ) instead. The np.where() code is below:

ten_13 = x.index.get_level_values('ten_13')
side   = x.index.get_level_values('side')

n_10 = ((ten_13 == '10') & (side == 'n')).reshape(-1,8)
n_13 = ((ten_13 == '13') & (side == 'n')).reshape(-1,8)
f_10 = ((ten_13 == '10') & (side == 'f')).reshape(-1,8)
f_13 = ((ten_13 == '13') & (side == 'f')).reshape(-1,8)

web_col = np.where(n_10, np.arange(1, 9, 1),
                   np.where(f_10, np.arange( 17, 25, 1),
                            np.where(n_13, np.arange(9, 17, 1),
                                     np.where(f_13, np.arange(25,33, 1), np.arange(1, 9, 1)))))
x['web_col'] = web_col.reshape(64,1)
x.head()
Cleb
  • 25,102
  • 20
  • 116
  • 151
Ragy Isaac
  • 1,458
  • 1
  • 17
  • 22
  • Why would you like to use groupby() specifically? groupby() doesn't feel like the right tool for this problem. groupby() is great for creating a grouping over which you will perform some aggregation, but that's not your goal. Perhaps an apply() is more appropriate? – SPKoder Feb 15 '16 at 01:30
  • Because I need to know more about groupby(), What tool do you think is appropriate for this problem? – Ragy Isaac Feb 15 '16 at 02:02
  • Try [apply](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html). See [this question](http://stackoverflow.com/questions/21702342/creating-a-new-column-based-on-if-elif-else-condition) for a good example – Bill Harper Feb 15 '16 at 02:24

1 Answers1

-1

You could use pandas.apply() for this, but considering that you have only 4 conditions, the following is probably more efficient.

x.loc[(x['side'] == 'n') & (x['ten_13'] == '10'), 'web_col'] = '1:8'
... for each condition

If you set the appropriate indexes, this will be even faster.

x.set_index(['side', 'ten_13'], inplace=True)
x.sortlevel(inplace=True)  # speeds up lookups
x.loc[('n', '10'), 'web_col'] = '1:8'
... for each condition
x.reset_index(inplace=True) # to restore original dataframe
Chinmay Kanchi
  • 62,729
  • 22
  • 87
  • 114
  • thank you but the code you supplied does not work for the model data supplied. Additionally, I requested the solution to be with groupby ( ) method – Ragy Isaac Feb 15 '16 at 12:18
  • There was a typo in the code - also, groupby is the wrong tool for the job here. – Chinmay Kanchi Feb 15 '16 at 12:55
  • Your code has the typo or mine? I understand, but I requested groupby(). Please notice that I provided a way to create the column using np.where( ). The purpose is for me to learn groupby ( ) – Ragy Isaac Feb 15 '16 at 13:05