How to get rid of \ufeff in parsed html page

Question

The code is

!wget -q -O 'boroughs.html' "https://en.wikipedia.org/wiki/List_of_London_boroughs"

with open('boroughs.html', encoding='utf-8-sig') as fp:
    soup = BeautifulSoup(fp,"lxml")


data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append([col for col in cols]) # Get rid of empty values
data

I've added encoding='utf-8-sig' to open after some research. But in the output I still see the characters \ufeff:

What puzzles me, I've even tried the hacky way with

df = df.replace(u'\ufeff', '')

after adding data to pandas dataframe

And the characters are still there.

Possible duplicate of [python utf-8-sig BOM in the middle of the file when appending to the end](https://stackoverflow.com/questions/23154355/python-utf-8-sig-bom-in-the-middle-of-the-file-when-appending-to-the-end) — sashaboulouds, Jun 30 '19 at 15:41
For utf-8-sig, I think you should be replacing ''\xef\xbb\xbf'. Or you can try decoding with str.decode('utf-8-sig') — Alexandre Juma, Jun 30 '19 at 15:41
@AlexandreJuma sorry, can you elaborate on this? Should I try str.decode on the column? — Mike K, Jun 30 '19 at 16:51
Have you tried `data.append([col.replace(u'\ufeff', '') for col in cols])` ? — Alexandre Juma, Jun 30 '19 at 17:16
I really feel I'm missing something obvious here... Added a line `df = df.append([col.replace(u'\ufeff', '') for col in columns])`. Then I do some data cleanup and as the last step I cast longitude column as float and get this error: `ValueError: ('Unable to parse string " 0.1557\ufeff " at position 0', 'occurred at index Longitude')` which i assume means, that \ufeff is still there? — Mike K, Jun 30 '19 at 17:37
Is there any reason for using pandas dataframes? I've just ran your code with a simple string replace and it works fine (i.e: removes \ufeff). I'll post an answer — Alexandre Juma, Jun 30 '19 at 18:36

score 3 · Answer 1 · answered Jun 14 '22 at 10:39

3

Try the following:

with open('boroughs.html', encoding='utf-8-sig') as fp:

answered Jun 14 '22 at 10:39

blupacetek

155
1
9

score 0 · Answer 2 · answered Jun 30 '19 at 15:41

0

Try using utf8 instead :

with open('boroughs.html', encoding='utf8') as fp:
    doc = html.fromstring(fp.read())

    data = []
    rows = doc.xpath("//table/tbody/tr")
    for row in rows:
        cols = row.xpath("./td/text()")
        cols = [col.strip() for col in cols if col.strip()]
        data.append(cols)

answered Jun 30 '19 at 15:41

sashaboulouds

1,566
11
16

I'm afraid, i'm getting the same results with `encoding=utf8'` – Mike K Jun 30 '19 at 15:55

score 0 · Accepted Answer · answered Jun 30 '19 at 18:41

I've tried your code using Python 3.6.1 with a simple str.replace(u'\ufeff', '') and it seems to work.

Code tested:

import os
from bs4 import BeautifulSoup

os.system('wget -q -O "boroughs.html" "https://en.wikipedia.org/wiki/List_of_London_boroughs"')

with open('boroughs.html', encoding='utf-8-sig') as fp:
    soup = BeautifulSoup(fp,"lxml")

data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append([col.replace(u'\ufeff', '') for col in cols])
print(data)

Output before replace:

[[], ['Barking and Dagenham [note 1]', '', '', 'Barking and Dagenham London Borough Council', 'Labour', 'Town Hall, 1 Town Square', '13.93', '194,352', '51°33′39″N 0°09′21″E\ufeff / \ufeff51.5607°N 0.1557°E\ufeff / 51.5607; 0.1557\ufeff (Barking and Dagenham)', '25'], ... ]

Output after replace:

[[], ['Barking and Dagenham [note 1]', '', '', 'Barking and Dagenham London Borough Council', 'Labour', 'Town Hall, 1 Town Square', '13.93', '194,352', '51°33′39″N 0°09′21″E / 51.5607°N0.1557°E / 51.5607; 0.1557 (Barking and Dagenham)', '25'], ... ]

Thanks! `data.append([col.replace(u'\ufeff', '') for col in cols])` or using `os.system` did the trick!! It did mess up column headings, but I'll have to fix this :D — Mike K, Jun 30 '19 at 19:11
Great. If no changes are required, you can accept the answer. — Alexandre Juma, Jun 30 '19 at 19:15

How to get rid of \ufeff in parsed html page

3 Answers3