4

This is my pandas DataFrame with original column names.

old_dt_cm1_tt   old_dm_cm1   old_rr_cm2_epf   old_gt
1               3            0                0
2               1            1                5
  1. Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
  2. After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
  3. Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt   old_dm_cm1   old_rr_cm2_epf   old_gt    cm1    cm2    
1               3            0                0         2      0        
2               1            1                5         2      1

I implemented the first step as follows:

cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns

How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.

JoeBlack
  • 471
  • 2
  • 4
  • 13

2 Answers2

2

You can use:

print df
   old_dt_cm1_tt  old_dm_cm1  old_rr_cm2_epf  old_gt
0              1           3               0       0
1              2           1               1       5

First you can filter columns contains string cm, so columns without cm are removed.

df1 = df.filter(regex='cm')

Now you can change columns to new values like cm1, cm2, cm3.

print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']

df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
   cm1  cm1  cm2
0    1    3    0
1    2    1    1

Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.

df1 = df1.astype(bool)
print df1
    cm1   cm1    cm2
0  True  True  False
1  True  True   True

print df1.groupby(df1.columns, axis=1).sum()
   cm1  cm2
0    2    0
1    2    1

You need unique columns, which are added to original df:

print df1.columns.unique()
['cm1' 'cm2']

Last you can add new columns by df[['cm1','cm2']] from groupby function:

df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
   old_dt_cm1_tt  old_dm_cm1  old_rr_cm2_epf  old_gt  cm1  cm2
0              1           3               0       0    2    0
1              2           1               1       5    2    1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This is clearly a great method but you need a bit more explanation as I don't know much about pandas and wouldn't be able to use this other then copy-paste which is not all that helpful to the OP in the long run. – Tadhg McDonald-Jensen Mar 11 '16 at 15:35
  • 1
    Ok, no problem. I try explain more. – jezrael Mar 11 '16 at 15:36
  • I'd also like to point out that `[cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']` will raise an error if `cm` is present in a column name more then once (like `cmd_cm1`) where as `['cm'+c[c.index('cm') + 2] for c in df.columns if 'cm' in c]` will only use the first occurrence of the substring. – Tadhg McDonald-Jensen Mar 11 '16 at 15:44
  • Although two methods work on different principles, maybe just don't use list comprehension and add appropriate `break` statements to prevent keeping multiple unique names from a single column. – Tadhg McDonald-Jensen Mar 11 '16 at 15:46
  • Thank you for checking my code. Now I have to go away. I ask OP if this `strings` like `cmd` can be in column `names`. If yes, I update answer. If not, no problem. – jezrael Mar 11 '16 at 15:59
  • Now that I understand what you are doing I see that `df1 = df.filter(regex="cm")` is creating a copy without the need for `df1 = df.copy()`. But my goodness this is a **wonderful** answer! – Tadhg McDonald-Jensen Mar 11 '16 at 16:09
0

Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:

col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
                                   #   ^ if you are hard coding this in you might as well use 2

so that instead of the string after cm it is cm and the character directly following, in this case it would be:

{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}

Then add the new columns to the DataFrame by iterating over the dict:

for col,new_col in col_map.items():
    if new_col not in df:
        df[new_col] =[int(a!=0) for a in df[col]]
    else:
        df[new_col]+=[int(a!=0) for a in df[col]]

note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)

import operator

for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
    if new_col in df:
        df[new_col]+=[int(a!=0) for a in df[col]]
    else:
        df[new_col] =[int(a!=0) for a in df[col]]

to ensure the new columns are inserted in order.

Community
  • 1
  • 1
Tadhg McDonald-Jensen
  • 20,699
  • 5
  • 35
  • 59