2

Currently cleaning data from a csv file. Successfully mad everything lowercase, removed stopwords and punctuation etc. But need to remove special characters. For example, the csv file contains things such as 'César' '‘disgrace’'. If there is a way to replace these characters then even better but I am fine with removing them. Below is the code I have so far.

import pandas as pd
from nltk.corpus import stopwords
import string
from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()

pd.read_csv('soccer.csv', encoding='utf-8')
df = pd.read_csv('soccer.csv')

df.columns = ['post_id', 'post_title', 'subreddit']
df['post_title'] = df['post_title'].str.lower().str.replace(r'[^\w\s]+', '').str.split()


stop = stopwords.words('english')

df['post_title'] = df['post_title'].apply(lambda x: [item for item in x if item not in stop])

df['post_title']= df['post_title'].apply(lambda x : [lemma.lemmatize(y) for y in x])


df.to_csv('clean_soccer.csv')
plshelpme_
  • 47
  • 3
  • 9

3 Answers3

2

When saving the file try:

df.to_csv('clean_soccer.csv', encoding='utf-8-sig')

or simply

df.to_csv('clean_soccer.csv', encoding='utf-8')
VnC
  • 1,936
  • 16
  • 26
0

I'm not sure if there's an easy way to replace the special characters, but I know how you can remove them. Try using:

df['post_title']= df['post_title'].str.replace(r'[^A-Za-z0-9]+', '')

That should replace 'César' '‘disgrace’' with 'Csardisgrace'. Hope this helps.

charlzee
  • 26
  • 2
0

As an alternative to other answers, you could use string.printable:

import string

printable = set(string.printable)

def remove_spec_chars(in_str):
    return ''.join([c for c in in_str if c in printable])

df['post_title'].apply(remove_spec_chars)

For reference, string.printable varies by machine, which is a combination of digits, ascii_letters, punctuation, and whitespace.

For your example string César' '‘disgrace’' this function returns 'Csardisgrace'.

https://docs.python.org/3/library/string.html
How can I remove non-ASCII characters but leave periods and spaces using Python?

xibalba1
  • 538
  • 8
  • 16