-1

Context:

I have a df like this:

title text
Donald Trump Sends Out $15B Donald Trump just couldn't wish all Americans
Drunk Bragging Trump Staffer Started House Intelligence Committee Chairman Devin
... ...

Both title and text are of object datatype

I am trying to run the following code:

for i in range (0, len(msg)):
    review = re.sub('[^a-zA-Z]',' ', df['title'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

Error:

However, I am getting the following error on re.sub line:

TypeError: expected string or bytes-like object

I referred to this question. But no progress. I am still getting same error.

Desired output:

>code: corpus[0:1]
>Result: [['donald trump send    b'], ['drink brag trump staffer start']]

What I tried?

I tried all the possibilities from the above SO link. Also, tried changing the datatype of column by df['title'] = df['title'].astype('string'). Getting same error :(

Additional info:

  • If I use different code to replace non-alphabets and try to run, I am getting AttributeError: 'Series' object has no attribute 'lower' error in lower() line
  • I have a different df in different notebook. This code works perfect (object being datatype)

Any help would be appreciated!

Arun
  • 1,071
  • 3
  • 13
  • Can you please post the `expected output` based on your sample input? – Mayank Porwal Jan 05 '22 at 06:53
  • `object` makes no sense at all here; how did you create them? – tripleee Jan 05 '22 at 06:55
  • @MayankPorwal, Have added desired output. (changed input for better interpretation) – Arun Jan 05 '22 at 06:59
  • @tripleee, The dataset is available in kaggle and by default the dtype is `object`. I heard `object` refers to `string` – Arun Jan 05 '22 at 07:00
  • Can you try typecasting `df['title'][i]` to `str(df['title'][i])` inside `re.sub` statement? – Sandeep Gusain Jan 05 '22 at 07:07
  • @SandeepGusain, I tried this: `re.sub('[^a-zA-Z]',' ', str(df['title'][i]))`.. but getting `KeyError: 23481` – Arun Jan 05 '22 at 07:12
  • @Arun, See if @SultanOrazbayev answer helps. Else try checking if `df['title'][i]` contains the value you want. – Sandeep Gusain Jan 05 '22 at 07:28
  • @SandeepGusain, It did not help and if I try to ignore `re.sub` step, I am getting `AttributeError: 'Series' object has no attribute 'lower'` as mentioned in additional info – Arun Jan 05 '22 at 07:57

1 Answers1

0

One common reason for this error is that there is a NaN in the relevant column. A quick way to fix this is to assign an empty string to such values:

df['title'] = df['title'].fillna("").astype('string')
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46