How do you convert Python's urllib2.urlopen() to text?

Question

I'm doing a program on python that does the following:

Gets info from a web.
Puts it on a .txt file.

I've used urllib2.urlopen() for giving me the HTML code, but I want the info of the page. I say:

urllib2.urlopen() gets HTML. But I want that HTML written on text, I don't want HTML code!!

My program at the moment:

import urllib2
import time
url = urllib2.urlopen('http://www.dev-explorer.com/articles/using-python-httplib')
html = url.readlines()
for line in html:
    print line

time.sleep(5)

Well, it imports urllib2, and then it gets the HTML. That works, but I need a text, not an HTML! — AlexINF, Dec 08 '15 at 13:53
Even if it's 2 lines of code, it's still worth to put it in your question. — DainDwarf, Dec 08 '15 at 13:56

score 1 · Accepted Answer · edited May 23 '17 at 11:44

1

You have to use some method to read what you are opening:

url = urllib2.urlopen('someURL')
html = url.readlines()
for line in html:
    #At this level you already have a str in 'line'
    #do something

Also you have other methods: read, readline

Edit:

As I said in one of my comments in this thread, maybe you need to use BeautifulSoup to scrap what you want. So, I think this was already solved here.

You have to install BeautifulSoup:

pip install BeautifulSoup

Then you have to do what is in the example:

from bs4 import BeautifulSoup
import urllib2    
import re

html = urllib.urlopen('someURL').read()
soup = BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

And if you have some problem with ascii characters, you have to change str(element) to unicode(element) in the visible function.

edited May 23 '17 at 11:44

Community

1
1

answered Dec 08 '15 at 14:02

pazitos10

1,641
16
25

Doesn't work for me, it just prints the HTML. I want to print the text only. – AlexINF Dec 08 '15 at 14:06
Do you want plain text in the html? Maybe you can scrap it out with BeautifulSoup or some similar library – pazitos10 Dec 08 '15 at 14:08
So, that would convert the HTML to plain text? – AlexINF Dec 08 '15 at 14:13
Sorry, you are not being specific and I misunderstood what you want. Using the code above you get STRINGS like this: '....' not HTML, so you can treat it like a regular string value. Is that what you want to get? Or it's something else? – pazitos10 Dec 08 '15 at 14:17
I want to convert that HTML to plain text. – AlexINF Dec 08 '15 at 14:18
So can't you wtite that lines into a .txt file with write()? Can you edit this post and add a quick example? – pazitos10 Dec 08 '15 at 14:21
... .-. The problem is that when I scan the page, it gives me a HTML string. I want one that is on normal text! – AlexINF Dec 08 '15 at 17:12

score 0 · Answer 2 · answered Dec 08 '15 at 14:06

0

You could use the requests package which is my preference over urllib. This returns all the html from the web page.

import requests

response  = requests.get('http://stackoverflow.com/questions/34157599/how-do-you-convert-pythons-urllib2-urlopen-to-text')

with open('test.txt' 'w' ) as f:
   f.writelines(response.text)
f.close()

answered Dec 08 '15 at 14:06

johnfk3

469
2
5
15

But, I don't want the HTML! I want the text! Just like you were on the page! – AlexINF Dec 08 '15 at 14:08
You want to scrape the text from a web page so. As the other answer said you should use BeautifulSoup or something like that – johnfk3 Dec 08 '15 at 14:10
So, that would convert the HTML to text? – AlexINF Dec 08 '15 at 14:12
Yes that will 'scrape' the text for you from whatever field in the html you want – johnfk3 Dec 08 '15 at 14:15
[link](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/) Here is a detailed tutorial for you – johnfk3 Dec 08 '15 at 14:16

How do you convert Python's urllib2.urlopen() to text?

2 Answers2