How to handle "expected string or bytes-like object" while removing non-alphabets from a pandas df column using regular expression?

Question

Context:

I have a df like this:

title	text
Donald Trump Sends Out $15B	Donald Trump just couldn't wish all Americans
Drunk Bragging Trump Staffer Started	House Intelligence Committee Chairman Devin
...	...

Both title and text are of object datatype

I am trying to run the following code:

for i in range (0, len(msg)):
    review = re.sub('[^a-zA-Z]',' ', df['title'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

Error:

However, I am getting the following error on re.sub line:

TypeError: expected string or bytes-like object

I referred to this question. But no progress. I am still getting same error.

Desired output:

>code: corpus[0:1]
>Result: [['donald trump send    b'], ['drink brag trump staffer start']]

What I tried?

I tried all the possibilities from the above SO link. Also, tried changing the datatype of column by df['title'] = df['title'].astype('string'). Getting same error :(

Additional info:

If I use different code to replace non-alphabets and try to run, I am getting AttributeError: 'Series' object has no attribute 'lower' error in lower() line
I have a different df in different notebook. This code works perfect (object being datatype)

Any help would be appreciated!

Can you please post the `expected output` based on your sample input? — Mayank Porwal, Jan 05 '22 at 06:53
`object` makes no sense at all here; how did you create them? — tripleee, Jan 05 '22 at 06:55
@MayankPorwal, Have added desired output. (changed input for better interpretation) — Arun, Jan 05 '22 at 06:59
@tripleee, The dataset is available in kaggle and by default the dtype is `object`. I heard `object` refers to `string` — Arun, Jan 05 '22 at 07:00
Can you try typecasting `df['title'][i]` to `str(df['title'][i])` inside `re.sub` statement? — Sandeep Gusain, Jan 05 '22 at 07:07
@SandeepGusain, I tried this: `re.sub('[^a-zA-Z]',' ', str(df['title'][i]))`.. but getting `KeyError: 23481` — Arun, Jan 05 '22 at 07:12
@Arun, See if @SultanOrazbayev answer helps. Else try checking if `df['title'][i]` contains the value you want. — Sandeep Gusain, Jan 05 '22 at 07:28
@SandeepGusain, It did not help and if I try to ignore `re.sub` step, I am getting `AttributeError: 'Series' object has no attribute 'lower'` as mentioned in additional info — Arun, Jan 05 '22 at 07:57

score 0 · Answer 1 · answered Jan 05 '22 at 07:19

0

One common reason for this error is that there is a NaN in the relevant column. A quick way to fix this is to assign an empty string to such values:

df['title'] = df['title'].fillna("").astype('string')

answered Jan 05 '22 at 07:19

SultanOrazbayev

14,900
3
16
46

Thanks for your time. I ran this code and getting same error in `re.sub` line – Arun Jan 05 '22 at 07:29
Also, there are no `NaN` values as `df.isna().sum()` gave me `0` – Arun Jan 05 '22 at 07:46

How to handle "expected string or bytes-like object" while removing non-alphabets from a pandas df column using regular expression?

1 Answers1