Pandas read_csv() with HTML special characters

Question

I am cleaning up a CSV file in Python/Pandas, comma delimited.

Some of the cells have & as part of the text. When I run read_csv(), it is seeing that semicolon as the end of the current cell and offsetting the rest of the row.

I've tried encoding='utf8' and various other options...

EDIT** My code:

file = pd.read_csv('my-data-1.csv', encoding = 'utf8',index_col=False, low_memory=False)

file.drop(file.tail(1).index,inplace=True) #removing copyright line at the end


file_drop_dupes = file.drop_duplicates(['Project Id']) #drop the duplicates based on column Project Id

#drop all columns except these few
keep_col = ['Project Id','Project Name', 'Type']
new_file = file_drop_dupes[keep_col]
#write the result to a new csv file
new_file.to_csv('all-good-1.csv', index=False)

an example of field with HTML:

Service Maintenance &amp; Supply

Can you post an example and some code? I don't see this issue in my little test using pd.read_csv() — Stev, Feb 15 '18 at 16:32
And when you say that the rest of the row is offset, are you saying that Pandas is interpreting the semi-colon as the end of the field? If I create a df with that example field in, I can read it fine. Sorry I can't seem to help. — Stev, Feb 15 '18 at 17:44
Yes, pandas is using the semicolon as the end of that field and starting a new field. So on rows that contain the HTML character (maybe 10%), there ends up being an extra column at the end. — mustacheMcGee, Feb 16 '18 at 14:58
I don't see the problem on my test but it looks like [this](https://stackoverflow.com/questions/40399640/reading-csv-files-with-python-pandas-when-there-is-html-escaped-string-in-ther) might help you: — Stev, Feb 16 '18 at 15:19

score 0 · Answer 1 · answered Feb 15 '18 at 16:31

0

In python 3.4+, it's a simple html.unescape(). Before that, html.parser's HTMLParser.unescape(). See this answer.

answered Feb 15 '18 at 16:31

Personman

2,324
1
16
27

1

Can you explain how I'd integrate that? I tried something similar to this: https://stackoverflow.com/questions/40399640/reading-csv-files-with-python-pandas-when-there-is-html-escaped-string-in-ther However I get an error: UnicodeEncodeError: 'charmap' codec can't encode character '\u200e' – mustacheMcGee Feb 15 '18 at 16:55
I am in Python 3.5 btw – mustacheMcGee Feb 15 '18 at 16:57

score 0 · Answer 2 · answered Feb 15 '18 at 16:34

0

If you are using python 3+ html.unescape() is the solution

answered Feb 15 '18 at 16:34

SChowdhury

163
1
11

Pandas read_csv() with HTML special characters

2 Answers2