0

I am dealing with a very dirty data that has different kinds of character that I need to remove. Below is just a snapshot. I just want it remove only these characters from the starting character, however it removes all the characters in the col1. Data is in data frame:

col1:

, Matt R, Carl A
_ Hello, World_
). My Name is ). 'Amy' 
. My name is 'Matt' 
., My name is 'Clark'
My name is 'Amy' #clean row

Code:

articles[col1].str.replace(",","")
articles[col1].str.replace("_","")
articles[col1].str.replace(").","")
articles[col1].str.replace(".","")
articles[col1].str.replace(".,","")
sharp
  • 2,140
  • 9
  • 43
  • 80
  • 1
    Instead of removing `,_).` separately, would it work if you just remove until the first alphanumeric character? – Moon Cheesez May 02 '18 at 14:59
  • Possible duplicate of [Strip / trim all strings of a dataframe](https://stackoverflow.com/questions/40950310/strip-trim-all-strings-of-a-dataframe) – Moon Cheesez May 02 '18 at 15:05

2 Answers2

2

If you just want to remove the bad characters from the beginning of your strings, you can use pandas.Series.str.replace:

In [26]: df
Out[26]:
                     col1
0        , Matt R, Carl A
1         _ Hello, World_
2  ). My Name is ). 'Amy'
3     . My name is 'Matt'
4   ., My name is 'Clark'

In [27]: df['col1'] = df['col1'].str.replace(r'^[^a-zA-Z]+', '')

In [28]: df
Out[28]:
                  col1
0       Matt R, Carl A
1        Hello, World_
2  My Name is ). 'Amy'
3    My name is 'Matt'
4   My name is 'Clark'
user3483203
  • 50,081
  • 9
  • 65
  • 94
0

suppose the string is in a variable called 'a', then:

import re
re.sub(r'(\.,|_|\.|\)\.|,)(.*)', r'\2', a)

This returns:

 Matt R, Carl A
 Hello, World_
 My Name is ). 'Amy' 
 My name is 'Matt' 
 My name is 'Clark'
 My name is 'Amy' #clean row
Davy
  • 1,720
  • 1
  • 19
  • 42
  • I think @sharp only wants to remove those specific characters & combinations he/she mentioned – Davy May 02 '18 at 15:18