8

I am trying to filter a pandas dataframe using regular expressions. I want to delete those rows that do not contain any letters. For example:

Col A.
50000
$927848
dog
cat 583
rabbit 444

My desired results is:

Col A.
dog
cat 583
rabbit 444

I have been trying to solve this problem unsuccessful with regex and pandas filter options. See blow. I am specifically running into problems when I try to merge two conditions for the filter. How can I achieve this?

Option 1:

df['Col A.'] = ~df['Col A.'].filter(regex='\d+')

Option 2

df['Col A.'] = df['Col A.'].filter(regex=\w+)

Option 3

from string import digits, letters
df['Col A.'] = (df['Col A.'].filter(regex='|'.join(letters)))

OR

df['Col A.'] = ~(df['Col A.'].filter(regex='|'.join(digits)))

OR

df['Col A.'] = df[~(df['Col A.'].filter(regex='|'.join(digits))) & (df['Col A.'].filter(regex='|'.join(letters)))]
Susensio
  • 820
  • 10
  • 19
owwoow14
  • 1,694
  • 8
  • 28
  • 43

4 Answers4

25

I think you'd need str.contains to filter values which contain letters by the means of boolean indexing:

df =  df[df['Col A.'].str.contains('[A-Za-z]')]
print (df)
       Col A.
2         dog
3     cat 583
4  rabbit 444

If there are some NaNs values you can pass a parameter:

df = df[df['Col A.'].str.contains('[A-Za-z]', na=False)]    
print (df)
       Col A.
3         dog
4     cat 583
5  rabbit 444
VnC
  • 1,936
  • 16
  • 26
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

Have you tried:

df['Col A.'].filter(regex=r'\D')  # Keeps only if there's a non-digit character

or:

df['Col A.'].filter(regex=r'[A-Za-z]')  # Keeps only if there's a letter (alpha)

or:

df['Col A.'].filter(regex=r'[^\W\d_]')  # More info in the link below...

Explanation: https://stackoverflow.com/a/2039476/8933502

Samuel GIFFARD
  • 796
  • 6
  • 22
2

df['Col A.'].str.contains(r'^\d+$', na=True) # if string with only digits or if int/float then will result in NaN converted to True

eg: [50000, '$927848', 'dog', 'cat 583', 'rabbit 444', '3 e 3', 'e 3', '33', '3 e'] will give : [True,False,False,False,False,False,False, True,False]

Vaidic
  • 21
  • 3
1

You can use ^.*[a-zA-Z].*$

https://regex101.com/r/b84ji1/1

Details

^: Start of the line

.*: Match any character

[a-zA-Z]: Match letters

$: End of the line

Benoît Zu
  • 1,217
  • 12
  • 22
  • Since some of the rows have a combination of letters and digits, this approach actually filters all of my data. I am just interested in removing those rows that do not contain AT LEAST 1 letter – owwoow14 May 02 '18 at 12:41
  • This is working with your test case. It keep only string with at least one letter. I don't understand what is wrong for you ? – Benoît Zu May 02 '18 at 12:44