Reading CSV files with python (pandas) when there is HTML escaped string in there

Question

I'm trying to read a CSV file with pandas read_csv. The data looks like this (example)

thing;weight;price;colour
apple;1;2;red
m &amp; m's;0;10;several
cherry;0,5;2;dark red

Because of the HTML-escaped ampersand thingy, the second row would contain 5 fields according to pandas. How can I make sure, that thing gets read correctly?

The example here is pretty much how my data looks like: separator is ";", no string quotes, cp1251 encoding. The data I receive is pretty big, and reading it must run in one step (meaning no preprocessing outside of python).

I didn't find any reference in the pandas doc (I'm using pandas 0.19 with python 3.5.1). Any suggestions? Thanks in advance.

is the &amp always on the first column, and is it something that you would want to remove from the dataset entirely? — Christian W., Nov 03 '16 at 11:00
No, the & (and > and < and ä things) are all around the place. — , Dec 14 '16 at 09:19

score 4 · Accepted Answer · edited May 23 '17 at 12:24

Unescape the html character references:

import html
with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
    content = html.unescape(f.read())
    g.write(content)
print(content)
# thing;weight;price;colour
# apple;1;2;red
# m & m's;0;10;several
# cherry;0,5;2;dark red

Then load the csv in the usual way:

import pandas as pd
df = pd.read_csv('data-fixed.csv', sep=';')
print(df)

yields

     thing weight  price    colour
0    apple      1      2       red
1  m & m's      0     10   several
2   cherry    0,5      2  dark red

Although the data file is "pretty big", you appear to have enough memory to read it into a DataFrame. Therefore you should also have enough memory to read the file into a single string: f.read(). Converting the HTML with one call to html.unescape is more performant than calling html.unescape on many smaller strings. This is why I suggest using

with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
    content = html.unescape(f.read())
    g.write(content)

instead of something like

with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
    for line in f:
        g.write(html.unescape(line))

If you need to read this data file more than once, then it pays to fix it (and save it to disk) so you don't need to call html.unescape every time you wish to parse the data. That's why I suggest writing the unescaped contents to data-fixed.csv.

If reading this data is a one-off task and you wish to avoid the performance or resource cost of writing to disk, then you could use a StringIO (in-memory file-like object):

from io import StringIO
import html
import pandas as pd

with open('data.csv', 'r', encoding='cp1251') as f:
    content = html.unescape(f.read())
df = pd.read_csv(StringIO(content), sep=';')
print(df)

Would `df = pd.read_csv(content, sep=';')` work without writing the (possibly large) file? — IanS, Nov 03 '16 at 11:15
@IanS: You would need to wrap `content` in a StringIO: `from io import StringIO`, `df = pd.read_csv(StringIO(content), sep=';')`. — unutbu, Nov 03 '16 at 11:19

score 2 · Answer 2 · answered Nov 03 '16 at 11:12

2

You can use a regex as separator for pandas.read_csv In your specific case you can try:

pd.read_csv("test.csv",sep = "(?<!&amp);")
#         thing weight  price    colour
#0        apple      1      2       red
#1  m &amp; m's      0     10   several
#2       cherry    0,5      2  dark red

to select all the ; not preceded by &amp, this can be extended to other escaped characters

answered Nov 03 '16 at 11:12

JMat

752
6
14

1

If I try to modify this by using `sep = "(?<!&[a-z]*);"` then I get an error. But liked the idea... – Nov 17 '16 at 09:21

Reading CSV files with python (pandas) when there is HTML escaped string in there

2 Answers2

Linked