49

I have html file called test.html it has one word בדיקה.

I open the test.html and print it's content using this block of code:

file = open("test.html", "r")
print file.read()

but it prints ??????, why this happened and how could I fix it?

BTW. when I open text file it works good.

Edit: I'd tried this:

>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????
smci
  • 32,567
  • 20
  • 113
  • 146
david
  • 3,310
  • 7
  • 36
  • 59

8 Answers8

62
import codecs
f=codecs.open("test.html", 'r')
print f.read()

Try something like this.

vks
  • 67,027
  • 10
  • 91
  • 124
26

I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8':

with open("test.html", "r", encoding='utf-8') as f:
    text= f.read()
Chen Mier
  • 361
  • 3
  • 3
16

you can make use of the following code:

from __future__ import division, unicode_literals 
import codecs
from bs4 import BeautifulSoup

f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print(document)

If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:

import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
    line = (line.rstrip())
    if line:
        if re.match("^[A-Za-z]*$",line):
            if (line not in stop and len(line)>1):
                st=st+" "+line
print st

*define st as a string initially, like st=""

Rui Peres
  • 25,741
  • 9
  • 87
  • 137
Dibin Joseph
  • 251
  • 3
  • 6
8

You can read HTML page using 'urllib'.

 #python 2.x

  import urllib

  page = urllib.urlopen("your path ").read()
  print page
Benjamin
  • 2,257
  • 1
  • 15
  • 24
  • how can i do operations on `page`. ? like reading particular words from it etc. Can i use `page` like a string? – Sooraj Sep 04 '15 at 09:23
6

Use codecs.open with the encoding parameter.

import codecs
f = codecs.open("test.html", 'r', 'utf-8')
wenzul
  • 3,948
  • 2
  • 21
  • 33
1

CODE:

import codecs

path="D:\\Users\\html\\abc.html" 
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)
0

You can simply use this

import requests

requests.get(url)
-2

you can use 'urllib' in python3 same as

https://stackoverflow.com/a/27243244/4815313 with few changes.

#python3

import urllib

page = urllib.request.urlopen("/path/").read()
print(page)
Striezel
  • 3,693
  • 7
  • 23
  • 37
Suresh2692
  • 3,843
  • 3
  • 17
  • 26