How to open html file that contains Unicode characters?

Question

I have html file called test.html it has one word בדיקה.

I open the test.html and print it's content using this block of code:

file = open("test.html", "r")
print file.read()

but it prints ??????, why this happened and how could I fix it?

BTW. when I open text file it works good.

Edit: I'd tried this:

>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????

You need to open you file in UTF-8 format. http://stackoverflow.com/questions/491921/unicode-utf8-reading-and-writing-to-files-in-python — Tanveer Alam, Dec 02 '14 at 06:26
If it is still not working just post your page which you try to process. — wenzul, Dec 02 '14 at 07:45

vks · Accepted Answer · 2014-12-02T06:57:34.513

62

import codecs
f=codecs.open("test.html", 'r')
print f.read()

Try something like this.

edited Dec 02 '14 at 06:57

answered Dec 02 '14 at 06:34

vks

67,027
10
91
124

2

also i try codecs.open("test.html",'r','utf-8') , but when I print f.read() I get unicode decode error ! – david Dec 02 '14 at 06:42
I am using terminal !! – david Dec 02 '14 at 06:42
I got this error : UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: invalid continuation byte – david Dec 02 '14 at 06:50
>>> import sys >>> print sys.stdout.encoding UTF-8 – david Dec 02 '14 at 06:54
the file wasn't encoding utf-8 , it was windows-1255 ! – david Dec 02 '14 at 10:06

score 26 · Answer 2 · answered Jun 30 '18 at 23:15

I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8':

with open("test.html", "r", encoding='utf-8') as f:
    text= f.read()

score 16 · Answer 3 · edited Nov 12 '21 at 09:57

you can make use of the following code:

from __future__ import division, unicode_literals 
import codecs
from bs4 import BeautifulSoup

f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print(document)

If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:

import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
    line = (line.rstrip())
    if line:
        if re.match("^[A-Za-z]*$",line):
            if (line not in stop and len(line)>1):
                st=st+" "+line
print st

*define st as a string initially, like st=""

score 8 · Answer 4 · answered Dec 02 '14 at 06:33

8

You can read HTML page using 'urllib'.

 #python 2.x

  import urllib

  page = urllib.urlopen("your path ").read()
  print page

answered Dec 02 '14 at 06:33

Benjamin

2,257
1
15
24

how can i do operations on `page`. ? like reading particular words from it etc. Can i use `page` like a string? – Sooraj Sep 04 '15 at 09:23

score 6 · Answer 5 · answered Dec 02 '14 at 07:43

6

Use codecs.open with the encoding parameter.

import codecs
f = codecs.open("test.html", 'r', 'utf-8')

answered Dec 02 '14 at 07:43

wenzul

3,948
2
21
33

SHUBHAM SINGH · Answer 6 · 2019-02-01T10:55:14.140

1

CODE:

import codecs

path="D:\\Users\\html\\abc.html" 
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)

edited Feb 01 '19 at 10:55

answered Feb 01 '19 at 10:50

SHUBHAM SINGH

11
2

score 0 · Answer 7 · answered Jun 16 '21 at 17:57

0

You can simply use this

import requests

requests.get(url)

answered Jun 16 '21 at 17:57

Ayemun Hossain Ashik

470
6
13

score -2 · Answer 8 · edited Jun 24 '18 at 12:43

-2

you can use 'urllib' in python3 same as

https://stackoverflow.com/a/27243244/4815313 with few changes.

#python3

import urllib

page = urllib.request.urlopen("/path/").read()
print(page)

edited Jun 24 '18 at 12:43

Striezel

3,693
7
23
37

answered Feb 09 '16 at 13:13

Suresh2692

3,843
3
17
26

`AttributeError: 'module' object has no attribute 'request'` – tommy.carstensen Jan 08 '17 at 12:46
@tommy.carstensen may be you should take a look at this [urllib python3](https://docs.python.org/3/library/urllib.request.html#module-urllib.request) – Suresh2692 Jan 08 '17 at 16:26
1

Thanks. I'm quite familiar with that document. The indentation is wrong and it should be `import urllib.request`. – tommy.carstensen Jan 08 '17 at 16:51

How to open html file that contains Unicode characters?

8 Answers8

Linked

Related