How to remove special characters from csv using pandas

Question

Currently cleaning data from a csv file. Successfully mad everything lowercase, removed stopwords and punctuation etc. But need to remove special characters. For example, the csv file contains things such as 'CÃ©sar' 'â€˜disgraceâ€™'. If there is a way to replace these characters then even better but I am fine with removing them. Below is the code I have so far.

import pandas as pd
from nltk.corpus import stopwords
import string
from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()

pd.read_csv('soccer.csv', encoding='utf-8')
df = pd.read_csv('soccer.csv')

df.columns = ['post_id', 'post_title', 'subreddit']
df['post_title'] = df['post_title'].str.lower().str.replace(r'[^\w\s]+', '').str.split()


stop = stopwords.words('english')

df['post_title'] = df['post_title'].apply(lambda x: [item for item in x if item not in stop])

df['post_title']= df['post_title'].apply(lambda x : [lemma.lemmatize(y) for y in x])


df.to_csv('clean_soccer.csv')

Quite a few answers around, take a look e.g. to [this one](https://stackoverflow.com/a/5843547/3519000). Cheers. — lrnzcig, May 14 '19 at 14:57
try that: `df.to_csv('clean_soccer.csv', encoding='utf-8-sig)` or just `utf-8` — VnC, May 14 '19 at 15:09

score 2 · Accepted Answer · answered May 14 '19 at 15:24

2

When saving the file try:

df.to_csv('clean_soccer.csv', encoding='utf-8-sig')

or simply

df.to_csv('clean_soccer.csv', encoding='utf-8')

answered May 14 '19 at 15:24

VnC

1,936
16
26

score 0 · Answer 2 · answered May 14 '19 at 15:09

I'm not sure if there's an easy way to replace the special characters, but I know how you can remove them. Try using:

df['post_title']= df['post_title'].str.replace(r'[^A-Za-z0-9]+', '')

score 0 · Answer 3 · answered May 14 '19 at 15:19

As an alternative to other answers, you could use string.printable:

import string

printable = set(string.printable)

def remove_spec_chars(in_str):
    return ''.join([c for c in in_str if c in printable])

df['post_title'].apply(remove_spec_chars)

For reference, string.printable varies by machine, which is a combination of digits, ascii_letters, punctuation, and whitespace.

https://docs.python.org/3/library/string.html
How can I remove non-ASCII characters but leave periods and spaces using Python?

How to remove special characters from csv using pandas

3 Answers3