Get URL's plaintext data in python

Question

I would like to get the plain text (e.g. no html tags and entities) from a given URL. What library should I use to do that as quickly as possible?

I've tried (maybe there is something faster or better than this):

import re
import mechanize
br = mechanize.Browser()
br.open("myurl.com")
vh = br.viewing_html
//<bound method Browser.viewing_html of <mechanize._mechanize.Browser instance at 0x01E015A8>>

Thanks

possible dupe of http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python — TerryA, Jul 21 '13 at 07:12
@ChristianCareaga, the demo didn't work. (Internal Server Error) — funerr, Jul 21 '13 at 07:19

score 1 · Accepted Answer · answered Jul 21 '13 at 07:29

1

you can use HTML2Text if the site isnt working for you you can go to HTML2Text github Repo and get it for Python

or maybe try this:

import urllib
from bs4 import*

html = urllib.urlopen('myurl.com').read()
soup = BeautifulSoup(html)
text = soup.get_text()
print text

i dont know if it gets rid of all the js and stuff but it gets rid of the HTML

do some Google searches there are multiple other questions similar to this one

also maybe take a look at Read2Text

answered Jul 21 '13 at 07:29

Serial

7,925
13
52
71

Works for Python 2, not Python 3 – Brian Spiering Sep 15 '18 at 21:32
@BrianSpiering what isn't working? I assume the print statement is failing, it should be `print(text)` in Python 3. Run `pip install bs4` to get BeautifulSoup. Other than that I have not tested this with Python 3, this question was answered in 2013.. – Serial Sep 17 '18 at 20:58

score 0 · Answer 2 · answered Sep 18 '18 at 18:13

0

In Python 3, you can fetch the HTML as bytes and then convert to a string representation:

from urllib import request

text = request.urlopen('myurl.com').read().decode('utf8')

answered Sep 18 '18 at 18:13

Brian Spiering

1,002
1
9
18

OP wanted just the *text* from the web page, a.k.a with HTML tags stripped... Will this remove HTML? – Serial Sep 18 '18 at 18:58

Get URL's plaintext data in python

2 Answers2