Getting encoded csv from url into Pandas

Question

I'm struggling to get the following csv from this site into Pandas.

https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv

I've tried a few things, but so far I can't make a workable csv. The ultimate aim is to be able to make it into a Pandas dataframe.

Can anyone help point me in the right direction and explain why the below doesn't work?

Using Python 3.7, Windows 10

import requests  
import urllib
import csv

csv_url = 'https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv'

response = urllib.request.urlopen(csv_url)
cr = csv.reader(response)
for row in cr:
    print(row)
# csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

response = urllib.request.urlopen(csv_url)
response = response.read().decode()
cr = csv.reader(response)
for row in cr:
    print(row)
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 28452: invalid start byte

response = requests.get(csv_url).text
cr = csv.reader(response)
for row in cr:
    print(row)
# malformed, prints individual characters

Possible duplicate of [UnicodeDecodeError: 'utf-8' codec can't decode byte](https://stackoverflow.com/questions/19699367/unicodedecodeerror-utf-8-codec-cant-decode-byte) — Berenger, Apr 05 '19 at 14:05

Pedro Lobito · Accepted Answer · 2019-04-08T12:29:13.033

3

If you're using pandas >= 0.19.2, you can input the csv url directly.:

import pandas as pd
url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
c=pd.read_csv(url, encoding ='latin1') # otherwise you get a UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 12: invalid start byte

Demo1

Otherwise use String.IO, i.e.:

import pandas as pd
import requests
from io import StringIO
url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
s=requests.get(url).content
c=pd.read_csv(StringIO(s.decode("latin1")))

Demo2

edited Apr 08 '19 at 12:29

answered Apr 05 '19 at 13:57

Pedro Lobito

94,083
31
258
268

1

the edit to encoding='latin1' solved my problem, thank you! – Ben Jones Apr 05 '19 at 14:06
@lightalchemist, Seems to be an encoding problem of the [original csv](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv) . It looks already like [this](https://i.imgur.com/3pjeknb.jpg): when I open it in open-office. – Pedro Lobito Apr 05 '19 at 15:12
@PedroLobito I opened it in Microsoft Excel (on a Mac) and it shows the '-'. If you use 'cp1252' as the encoding, the mark will show up properly. Open-office probably was not able to figure out the proper encoding. – lightalchemist Apr 08 '19 at 09:21

score 2 · Answer 2 · answered Apr 05 '19 at 14:02

2

This is an encoding issue as the file appears to use Windows-specific encoding.

df = pd.read_csv(url, encoding='cp1252')

should work.

answered Apr 05 '19 at 14:02

lightalchemist

10,031
4
47
55

score 1 · Answer 3 · answered Apr 05 '19 at 14:07

1

Change encoding to cp1252

import pandas as pd
import io
import requests
url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode("cp1252")))

answered Apr 05 '19 at 14:07

pnv

2,985
5
29
36

Can someone explain why cp1252 is necessary over utf-8 or latin1? How would I identify this myself in future? – Ben Jones Apr 05 '19 at 14:14
1

@BenJones refer https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text. Hopefully it will help. – pnv Apr 05 '19 at 14:21

Getting encoded csv from url into Pandas

3 Answers3