1

I am cleaning up a CSV file in Python/Pandas, comma delimited.

Some of the cells have & as part of the text. When I run read_csv(), it is seeing that semicolon as the end of the current cell and offsetting the rest of the row.

I've tried encoding='utf8' and various other options...

EDIT** My code:

file = pd.read_csv('my-data-1.csv', encoding = 'utf8',index_col=False, low_memory=False)

file.drop(file.tail(1).index,inplace=True) #removing copyright line at the end


file_drop_dupes = file.drop_duplicates(['Project Id']) #drop the duplicates based on column Project Id

#drop all columns except these few
keep_col = ['Project Id','Project Name', 'Type']
new_file = file_drop_dupes[keep_col]
#write the result to a new csv file
new_file.to_csv('all-good-1.csv', index=False)

an example of field with HTML:

Service Maintenance & Supply
mustacheMcGee
  • 481
  • 6
  • 19
  • Can you post an example and some code? I don't see this issue in my little test using pd.read_csv() – Stev Feb 15 '18 at 16:32
  • Just added more context thx – mustacheMcGee Feb 15 '18 at 16:47
  • And when you say that the rest of the row is offset, are you saying that Pandas is interpreting the semi-colon as the end of the field? If I create a df with that example field in, I can read it fine. Sorry I can't seem to help. – Stev Feb 15 '18 at 17:44
  • Yes, pandas is using the semicolon as the end of that field and starting a new field. So on rows that contain the HTML character (maybe 10%), there ends up being an extra column at the end. – mustacheMcGee Feb 16 '18 at 14:58
  • I don't see the problem on my test but it looks like [this](https://stackoverflow.com/questions/40399640/reading-csv-files-with-python-pandas-when-there-is-html-escaped-string-in-ther) might help you: – Stev Feb 16 '18 at 15:19

2 Answers2

0

In python 3.4+, it's a simple html.unescape(). Before that, html.parser's HTMLParser.unescape(). See this answer.

Personman
  • 2,324
  • 1
  • 16
  • 27
  • 1
    Can you explain how I'd integrate that? I tried something similar to this: https://stackoverflow.com/questions/40399640/reading-csv-files-with-python-pandas-when-there-is-html-escaped-string-in-ther However I get an error: UnicodeEncodeError: 'charmap' codec can't encode character '\u200e' – mustacheMcGee Feb 15 '18 at 16:55
  • I am in Python 3.5 btw – mustacheMcGee Feb 15 '18 at 16:57
0

If you are using python 3+ html.unescape() is the solution

SChowdhury
  • 163
  • 1
  • 11