1

I have a pandas dataframe containing a column with strings (that are comma separated substrings). I want to remove some of the substrings and write the remaining ones to a new column in the same dataframe.

The code I have written to do this looks like this:

def remove_betas(df):
    for index,row in df.iterrows():
        list= row['Column'].split(',')
        if 'substring' in list:
            list.remove('beta-lactam')
            New= (',').join(list)
        elif not 'substring' in list:
            New= (',').join(Gene_list)
    return New
    df['NewColumn'].iloc[index]=New






 df.apply(remove_betas, axis=1)

When I run it, my new column contains only zeros. The thought behind this code is to get each string for each row in df, split it at comma into substrings and search the resulting list for the substring I want to remove. After removal, I join the list back together into a string and write that to a new column of df, at the same index position as the corresponding row.

What do I have to change to write the resulting substrings to a new column in the desired manner?

EDIT

By the way, I have tried to write a lambda expression as in how to compute a new column based on the values of other columns in pandas - python , but I cannot really figure out how to do everything in a vectorized function.

I also tried replacing the substring with nothing ( as in df.column.replace('x,?', ''), but that does not work since I have to count the lists later. Therefore the substring must be removed as in list.remove('substring')

Community
  • 1
  • 1
sequence_hard
  • 5,115
  • 10
  • 30
  • 50

1 Answers1

1

Why not employing a one liner regex solution:

import re

df = pd.DataFrame({'col1':[3,4,5],'col2':['a,ben,c','a,r,ben','cat,dog'],'col3':[1,2,3]})

#In [220]: df
#Out[220]:
#   col1     col2  col3
#0     3  a,ben,c     1
#1     4  a,r,ben     2
#2     5  cat,dog     3

df['new'] = df.col2.apply(lambda x: re.sub(',?ben|ben,?', '', x))

#In [222]: df
#Out[222]:
#   col1     col2  col3      new
#0     3  a,ben,c     1      a,c
#1     4  a,r,ben     2      a,r
#2     5  cat,dog     3  cat,dog

Or just use replace:

In [272]: df.col2.str.replace(',?ben|ben,?', '',case=False)
Out[272]:
0        a,c
1        a,r
2    cat,dog
Name: col2, dtype: object
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87
  • Hey, thanks for your answer. I already tried replace, but that is not an option as I have to count the list elements later. with df.column.replace('x,?', '') I stll add an element to the list, meaning it is still counted even when 'removed' from the list. – sequence_hard Dec 10 '15 at 15:30
  • Could you explain the difference to me? I tried your way, but as I said, when I do `df['columnNewNew']=df['new'].apply(lambda x: len(x.split(',')))` I still get the length of `col2`, not the "new" length – sequence_hard Dec 10 '15 at 15:37
  • which difference? sorry but your are asking a lot of questions in a single question! which is not SO working ... I already provided two different answer solving the original problem (with `apply` and a `vectorized` one) – Colonel Beauvel Dec 10 '15 at 15:38
  • I meant the difference between sub and replace. But I can also look that up. Your second answer works, thank you. – sequence_hard Dec 10 '15 at 15:48