Currently cleaning data from a csv file. Successfully mad everything lowercase, removed stopwords and punctuation etc. But need to remove special characters. For example, the csv file contains things such as 'César' '‘disgrace’'. If there is a way to replace these characters then even better but I am fine with removing them. Below is the code I have so far.
import pandas as pd
from nltk.corpus import stopwords
import string
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
pd.read_csv('soccer.csv', encoding='utf-8')
df = pd.read_csv('soccer.csv')
df.columns = ['post_id', 'post_title', 'subreddit']
df['post_title'] = df['post_title'].str.lower().str.replace(r'[^\w\s]+', '').str.split()
stop = stopwords.words('english')
df['post_title'] = df['post_title'].apply(lambda x: [item for item in x if item not in stop])
df['post_title']= df['post_title'].apply(lambda x : [lemma.lemmatize(y) for y in x])
df.to_csv('clean_soccer.csv')