Unescape the html character references:
import html
with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
content = html.unescape(f.read())
g.write(content)
print(content)
# thing;weight;price;colour
# apple;1;2;red
# m & m's;0;10;several
# cherry;0,5;2;dark red
Then load the csv in the usual way:
import pandas as pd
df = pd.read_csv('data-fixed.csv', sep=';')
print(df)
yields
thing weight price colour
0 apple 1 2 red
1 m & m's 0 10 several
2 cherry 0,5 2 dark red
Although the data file is "pretty big", you appear to have enough memory to read it into a DataFrame. Therefore you should also have enough memory to read the file into a single string: f.read()
. Converting the HTML with one call to html.unescape
is more performant than calling html.unescape
on many smaller strings. This is why I suggest using
with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
content = html.unescape(f.read())
g.write(content)
instead of something like
with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
for line in f:
g.write(html.unescape(line))
If you need to read this data file more than once, then it pays to fix it (and save it
to disk) so you don't need to call html.unescape
every time you wish to parse
the data. That's why I suggest writing the unescaped contents to data-fixed.csv
.
If reading this data is a one-off task and you wish to avoid the performance or resource cost of writing to disk, then you could use a StringIO (in-memory file-like object):
from io import StringIO
import html
import pandas as pd
with open('data.csv', 'r', encoding='cp1251') as f:
content = html.unescape(f.read())
df = pd.read_csv(StringIO(content), sep=';')
print(df)