2

I would like to remove all rows in a pandas dataframe that starts with a comment character. For example:

>>> COMMENT_CHAR = '#'
>>> df
    first_name    last_name
0   #fill in here fill in here
1   tom           jones

>>> df.remove(df.columns[0], startswith=COMMENT_CHAR) # in pseudocode
>>> df
    first_name    last_name
0   tom           jones

How would this actually be done?

David542
  • 104,438
  • 178
  • 489
  • 842

1 Answers1

1

Setup

>>> data = [['#fill in here', 'fill in here'], ['tom', 'jones']]                                                       
>>> df = pd.DataFrame(data, columns=['first_name', 'last_name'])                                                       
>>> df                                                                                                                 
      first_name     last_name
0  #fill in here  fill in here
1            tom         jones

Solution assuming only the strings in the first_name column matter:

>>> commented = df['first_name'].str.startswith('#')                                                                   
>>> df[~commented].reset_index(drop=True)                                                                              
  first_name last_name
0        tom     jones

Solution assuming you want to drop rows where the string in the first_name OR last_name column starts with '#':

>>> commented = df.apply(lambda col: col.str.startswith('#')).any(axis=1)                                             
>>> df[~commented].reset_index(drop=True)                                                                              
  first_name last_name
0        tom     jones

The purpose of reset_index is to re-label the rows starting from zero.

>>> df[~commented]                                                                                                     
  first_name last_name
1        tom     jones
>>>                                                                                                                    
>>> df[~commented].reset_index()                                                                                       
   index first_name last_name
0      1        tom     jones
>>>                                                                                                                    
>>> df[~commented].reset_index(drop=True)                                                                              
  first_name last_name
0        tom     jones
timgeb
  • 76,762
  • 20
  • 123
  • 145
  • could you please explain the purpose of doing `reset_index()` at the end of the call and why that's necessary? – David542 Dec 17 '18 at 21:42
  • @David542 sure - without `reset_index`, each row that is kept keeps its original row label. In this example, the remaining row would have the label `1`. With `reset_index`, you re-label the rows starting from `0` and `drop=True` prevents the original index you are killing being moved to the columns. – timgeb Dec 17 '18 at 21:44
  • 1
    thanks for adding it into the answer. – David542 Dec 17 '18 at 21:46