What is the fastest way to check whether a cell contains letters?

Question

I have a dataset with 2.6 million rows in which I have one column called msgText, which contains written messages.

Now, I want to filter out all messages that don't contain any letters. To do so I found the following code:

dataset = dataset[dataset['msgText'].astype(str).str.contains('[A-Za-z]')]

However, after 16 hours the code is still running.

Furthermore, based on Does Python have a string 'contains' substring method? I thought about creating a list of length 26, that contains all the letters in the alphabet and then check whether each cell contains that letter. But that does not seem efficient either.

Therefore, I am wondering if there is a faster way to find whether a cell contains letters.

EDIT: The code above works pretty well. Apparently, what I had in my (slow) code was: dataset['msgText'] = dataset[dataset['msgText'].astype(str).str.contains('[A-Za-z]')]

I actually tested your code and it takes about 5-6 seconds to complete with the following dataframe: `df = pd.DataFrame({'msgText': ['aasad dasd ', 'bdsd e ', 'ggre qew ', '1 1 1 ']*2600000})` — VnC, May 07 '19 at 12:38
Do your strings have spaces or special characters in them? or is it alway something like `'12adwf1231'` or `'123'` — ALollz, May 07 '19 at 15:04

score 2 · Accepted Answer · answered May 07 '19 at 11:55

2

import pandas

dataset['columnName'].apply(lambda x: x.find('\\w') > 0)

answered May 07 '19 at 11:55

Petronella

2,327
1
15
24

This does not give an error, but does not work either. After: `dataset['msgText'].iloc[0] = '?!)'`, I ran your code and it ouputs `False`, just like for cells that do contain letters – Emil May 07 '19 at 12:29
Ok, I cannot test it, but the trick is the lambda function. You can also extract it in a method that takes the value as an attribute and then return whatever you want from it. now it returns a boolean value that probably is ignored, that's why it doesn't do anything. you can also use regex: https://www.w3schools.com/python/showpython.asp?filename=demo_regex_seq8 – Petronella May 07 '19 at 12:32

VnC · Answer 2 · 2019-05-07T12:14:59.077

You can use numpy's isalpha() method. (Numpy is supposedly faster than pandas)

df = pd.DataFrame({'msgText': ['a', 'b', 'g', '1']})
column = df['msgText']
column[column.str.replace(' ','').str.isalpha()]

Would return:

0    a
1    b
2    g
Name: msgText, dtype: object

Test case with 2.6 mil rows:

import datetime

df = pd.DataFrame({'msgText': ['a', 'b', 'g', '1']*2600000})
column = df['msgText']
start = datetime.datetime.now()
new_col = column[column.str.replace(' ','').str.isalpha()]
end = datetime.datetime.now()
print(f'Time taken: {end - start}; Shape: {new_col.shape}')

OUTPUT:
Time taken: 0:00:06.144576; Shape: (7800000,)

This code also deletes cells that contain both letters as well as numbers/other characters. — Emil, May 07 '19 at 12:41

What is the fastest way to check whether a cell contains letters?

2 Answers2