2

The code is

!wget -q -O 'boroughs.html' "https://en.wikipedia.org/wiki/List_of_London_boroughs"

with open('boroughs.html', encoding='utf-8-sig') as fp:
    soup = BeautifulSoup(fp,"lxml")


data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append([col for col in cols]) # Get rid of empty values
data

I've added encoding='utf-8-sig' to open after some research. But in the output I still see the characters \ufeff:

What puzzles me, I've even tried the hacky way with

df = df.replace(u'\ufeff', '') 

after adding data to pandas dataframe

And the characters are still there.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Mike K
  • 163
  • 1
  • 7
  • Which version of python are you running? – Alexandre Juma Jun 30 '19 at 15:25
  • Python 3.5.5 :: Anaconda, Inc. – Mike K Jun 30 '19 at 15:29
  • 1
    Possible duplicate of [python utf-8-sig BOM in the middle of the file when appending to the end](https://stackoverflow.com/questions/23154355/python-utf-8-sig-bom-in-the-middle-of-the-file-when-appending-to-the-end) – sashaboulouds Jun 30 '19 at 15:41
  • For utf-8-sig, I think you should be replacing ''\xef\xbb\xbf'. Or you can try decoding with str.decode('utf-8-sig') – Alexandre Juma Jun 30 '19 at 15:41
  • @AlexandreJuma sorry, can you elaborate on this? Should I try str.decode on the column? – Mike K Jun 30 '19 at 16:51
  • Have you tried `data.append([col.replace(u'\ufeff', '') for col in cols])` ? – Alexandre Juma Jun 30 '19 at 17:16
  • I really feel I'm missing something obvious here... Added a line `df = df.append([col.replace(u'\ufeff', '') for col in columns])`. Then I do some data cleanup and as the last step I cast longitude column as float and get this error: `ValueError: ('Unable to parse string " 0.1557\ufeff " at position 0', 'occurred at index Longitude')` which i assume means, that \ufeff is still there? – Mike K Jun 30 '19 at 17:37
  • Is there any reason for using pandas dataframes? I've just ran your code with a simple string replace and it works fine (i.e: removes \ufeff). I'll post an answer – Alexandre Juma Jun 30 '19 at 18:36

3 Answers3

3

Try the following:

with open('boroughs.html', encoding='utf-8-sig') as fp:
blupacetek
  • 155
  • 1
  • 9
0

Try using utf8 instead :

with open('boroughs.html', encoding='utf8') as fp:
    doc = html.fromstring(fp.read())

    data = []
    rows = doc.xpath("//table/tbody/tr")
    for row in rows:
        cols = row.xpath("./td/text()")
        cols = [col.strip() for col in cols if col.strip()]
        data.append(cols)

sashaboulouds
  • 1,566
  • 11
  • 16
0

I've tried your code using Python 3.6.1 with a simple str.replace(u'\ufeff', '') and it seems to work.

Code tested:

import os
from bs4 import BeautifulSoup

os.system('wget -q -O "boroughs.html" "https://en.wikipedia.org/wiki/List_of_London_boroughs"')

with open('boroughs.html', encoding='utf-8-sig') as fp:
    soup = BeautifulSoup(fp,"lxml")

data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append([col.replace(u'\ufeff', '') for col in cols])
print(data)

Output before replace:

[[], ['Barking and Dagenham [note 1]', '', '', 'Barking and Dagenham London Borough Council', 'Labour', 'Town Hall, 1 Town Square', '13.93', '194,352', '51°33′39″N 0°09′21″E\ufeff / \ufeff51.5607°N 0.1557°E\ufeff / 51.5607; 0.1557\ufeff (Barking and Dagenham)', '25'], ... ]

Output after replace:

[[], ['Barking and Dagenham [note 1]', '', '', 'Barking and Dagenham London Borough Council', 'Labour', 'Town Hall, 1 Town Square', '13.93', '194,352', '51°33′39″N 0°09′21″E / 51.5607°N0.1557°E / 51.5607; 0.1557 (Barking and Dagenham)', '25'], ... ]

Alexandre Juma
  • 3,128
  • 1
  • 20
  • 46
  • 1
    Thanks! `data.append([col.replace(u'\ufeff', '') for col in cols])` or using `os.system` did the trick!! It did mess up column headings, but I'll have to fix this :D – Mike K Jun 30 '19 at 19:11
  • Great. If no changes are required, you can accept the answer. – Alexandre Juma Jun 30 '19 at 19:15