Extracting plain text from HTML using Python

Question

I'm trying to extract plain text from a website using python. My code is something like this (a slightly modified version of what I found here):

import requests
import urllib
from bs4 import BeautifulSoup
url = "http://www.thelatinlibrary.com/vergil/aen1.shtml"
r = requests.get(url)
k = r.content
file = open('C:\\Users\\Anirudh\\Desktop\\NEW2.txt','w')
soup = BeautifulSoup(k)
for script in soup(["Script","Style"]):
    script.exctract()
text = soup.get_text
file.write(repr(text))

This doesn't seem to work. I'm guessing that beautifulsoup doesn't accept r.content. What can I do to fix this?

This is the error -

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 8 of the file C:/Users/Anirudh/PycharmProjects/untitled/test/__init__.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))
Traceback (most recent call last):
  File "C:/Users/Anirudh/PycharmProjects/untitled/test/__init__.py", line 12, in <module>
    file.write(repr(text))
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x97' in position 2130: character maps to <undefined>

Process finished with exit code 1

try `soup = BeautifulSoup(K, 'html.parser')` and tell me if the error changes. — Harrison, Aug 14 '16 at 13:14
@Harrison , it is now - Traceback (most recent call last): File "C:/Users/Anirudh/PycharmProjects/untitled/test/__init__.py", line 12, in file.write(repr(text)) File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\x97' in position 2130: character maps to oh , by the way , what was that warning and what happened to it when you included html.parser ? — Vrisk, Aug 14 '16 at 13:18
@AnirudhGanesh If you look at the error message it's telling you that it can't encode this character http://www.codetable.net/hex/97 — Harrison, Aug 14 '16 at 13:21
@Harrison, I don't get it , there are no special characters used in source material - http://www.thelatinlibrary.com/vergil/aen1.shtml — Vrisk, Aug 14 '16 at 13:37

James K · Accepted Answer · 2016-08-14T13:56:58.393

2

The "error" is a warning, and is of no consequence. Quieten it with soup = BeautifulSoup(k, 'html.parser')

There seems to be a typo script.exctract() The word extract is spelt incorrectly.

The actual error seems to be that the content is a bytestring, but you are writing in text mode. The source contains an em dash. Handling this character is the problem.

You can encode with soup.encode("utf-8"). This means hardcoding the encoding into your script (which is bad). Or try using binary mode for the file open(..., 'wb'), or converting the content to a string before passing it to Beautiful Soup, using the correct encoding for that file, with k = str(r.content,"utf-8").

edited Aug 14 '16 at 13:56

answered Aug 14 '16 at 13:21

James K

3,692
1
28
36

Still same error with typo corrected and use of repr(k) – Vrisk Aug 14 '16 at 13:34
maybe convert to a string before passing to beautiful soup. – James K Aug 14 '16 at 13:41
I'm sorry , I think I misunderstood but str(k) doesn't help either . I did what you said in the answer , still the same result – Vrisk Aug 14 '16 at 13:44
can you `print` instead of `file.write` ? – James K Aug 14 '16 at 13:47
May I ask what's going on ? – Vrisk Aug 14 '16 at 13:53
The problem is the encoding the em dash. See the docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#miscellaneous – James K Aug 14 '16 at 13:58
Hey , doesn't your line of code convert everything into utf 8 before writing ? I edited my code with what you recommended but yet I got the same error ? – Vrisk Aug 14 '16 at 14:14
1

I seem to have fixed it , the code ought to have been soup.get_text() – Vrisk Aug 14 '16 at 15:39
Please write an answer. You are encouraged to solve your own problems. – James K Aug 14 '16 at 17:58

score 0 · Answer 2 · answered Aug 16 '16 at 19:06

There was a — on the code which resulted in an error , '—' being non utf-8 . Changing the encoding before passing text on to BeautifulSoup fixed the issue .

Another error was due to soup.get_text . Missing out () implied I was referencing the method , not the output .

Extracting plain text from HTML using Python

2 Answers2