3

I'm struggling to get the following csv from this site into Pandas.

https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv

I've tried a few things, but so far I can't make a workable csv. The ultimate aim is to be able to make it into a Pandas dataframe.

Can anyone help point me in the right direction and explain why the below doesn't work?

Using Python 3.7, Windows 10

import requests  
import urllib
import csv

csv_url = 'https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv'

response = urllib.request.urlopen(csv_url)
cr = csv.reader(response)
for row in cr:
    print(row)
# csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

response = urllib.request.urlopen(csv_url)
response = response.read().decode()
cr = csv.reader(response)
for row in cr:
    print(row)
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 28452: invalid start byte

response = requests.get(csv_url).text
cr = csv.reader(response)
for row in cr:
    print(row)
# malformed, prints individual characters
Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
Ben Jones
  • 555
  • 6
  • 22
  • Possible duplicate of [UnicodeDecodeError: 'utf-8' codec can't decode byte](https://stackoverflow.com/questions/19699367/unicodedecodeerror-utf-8-codec-cant-decode-byte) – Berenger Apr 05 '19 at 14:05

3 Answers3

3

If you're using pandas >= 0.19.2, you can input the csv url directly.:

import pandas as pd
url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
c=pd.read_csv(url, encoding ='latin1') # otherwise you get a UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 12: invalid start byte  

Demo1


Otherwise use String.IO, i.e.:

import pandas as pd
import requests
from io import StringIO
url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
s=requests.get(url).content
c=pd.read_csv(StringIO(s.decode("latin1")))

Demo2

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • 1
    the edit to encoding='latin1' solved my problem, thank you! – Ben Jones Apr 05 '19 at 14:06
  • @lightalchemist, Seems to be an encoding problem of the [original csv](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv) . It looks already like [this](https://i.imgur.com/3pjeknb.jpg): when I open it in open-office. – Pedro Lobito Apr 05 '19 at 15:12
  • @PedroLobito I opened it in Microsoft Excel (on a Mac) and it shows the '-'. If you use 'cp1252' as the encoding, the mark will show up properly. Open-office probably was not able to figure out the proper encoding. – lightalchemist Apr 08 '19 at 09:21
2

This is an encoding issue as the file appears to use Windows-specific encoding.

df = pd.read_csv(url, encoding='cp1252')

should work.

lightalchemist
  • 10,031
  • 4
  • 47
  • 55
1

Change encoding to cp1252

import pandas as pd
import io
import requests
url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode("cp1252")))
pnv
  • 2,985
  • 5
  • 29
  • 36
  • Can someone explain why cp1252 is necessary over utf-8 or latin1? How would I identify this myself in future? – Ben Jones Apr 05 '19 at 14:14
  • 1
    @BenJones refer https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text. Hopefully it will help. – pnv Apr 05 '19 at 14:21