0

I'm trying to replace a column in a DataFrame with preprocessed text data.

I have imported an Excel file as pandas dataframe.

df = pd.read_excel (*file path*)

This file consists of x rows of documents and 12 columns.

I extracted the column 'Text' for NLP.

text_article = (df['Text'])

I have preprocessed this column (removal of digits, stopwords, tokenization, lemmatization etc.) Resulting in the following variable: text_article['final']

I now want to replace the column (df['Text']) with text_article['final'], but don't know how.

When I export the dataframe, I get the original column 'Text'

df.to_excel('*name*.xlsx', index=False)

I've tried the following code to replace the column or add the column, but it doesn't seem to work.

df.insert(text_article['final'])

and

text_article['final'] = df['Text']

I'm relatively new to Python, so I hope I've clearly formulated my question. Thanks in advance.

Annick
  • 1
  • 1

2 Answers2

1

If both columns have the same length, this should work :

df['Text'] = text_article['final']

You did it the other way around. You must assign to the first variable df['Text'] the new value text_article['final'].

Also, this post might answer your question.

junsuzuki
  • 100
  • 7
  • Thanks for your quick response. It does seem to work, but the text in the column remains the original text, instead of the preprocessed text. So the changes I've made in text_article['final'] are lost. – Annick Aug 12 '22 at 15:28
0

I was able to add the column with preprocessed text to the dataframe by using the following code:

df2 = df.assign(Title_New_Column = text_article['final'])
df2.to_excel('File_Name.xlsx', index=False)
Annick
  • 1
  • 1