3

I'm doing a program on python that does the following:

  • Gets info from a web.
  • Puts it on a .txt file.

I've used urllib2.urlopen() for giving me the HTML code, but I want the info of the page. I say:

urllib2.urlopen() gets HTML. But I want that HTML written on text, I don't want HTML code!!

My program at the moment:

import urllib2
import time
url = urllib2.urlopen('http://www.dev-explorer.com/articles/using-python-httplib')
html = url.readlines()
for line in html:
    print line

time.sleep(5)
AlexINF
  • 225
  • 2
  • 17

2 Answers2

1

You have to use some method to read what you are opening:

url = urllib2.urlopen('someURL')
html = url.readlines()
for line in html:
    #At this level you already have a str in 'line'
    #do something

Also you have other methods: read, readline

Edit:

As I said in one of my comments in this thread, maybe you need to use BeautifulSoup to scrap what you want. So, I think this was already solved here.

You have to install BeautifulSoup:

pip install BeautifulSoup

Then you have to do what is in the example:

from bs4 import BeautifulSoup
import urllib2    
import re

html = urllib.urlopen('someURL').read()
soup = BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

And if you have some problem with ascii characters, you have to change str(element) to unicode(element) in the visible function.

Community
  • 1
  • 1
pazitos10
  • 1,641
  • 16
  • 25
  • Doesn't work for me, it just prints the HTML. I want to print the text only. – AlexINF Dec 08 '15 at 14:06
  • Do you want plain text in the html? Maybe you can scrap it out with BeautifulSoup or some similar library – pazitos10 Dec 08 '15 at 14:08
  • So, that would convert the HTML to plain text? – AlexINF Dec 08 '15 at 14:13
  • Sorry, you are not being specific and I misunderstood what you want. Using the code above you get STRINGS like this: '....' not HTML, so you can treat it like a regular string value. Is that what you want to get? Or it's something else? – pazitos10 Dec 08 '15 at 14:17
  • I want to convert that HTML to plain text. – AlexINF Dec 08 '15 at 14:18
  • So can't you wtite that lines into a .txt file with write()? Can you edit this post and add a quick example? – pazitos10 Dec 08 '15 at 14:21
  • ... .-. The problem is that when I scan the page, it gives me a HTML string. I want one that is on normal text! – AlexINF Dec 08 '15 at 17:12
0

You could use the requests package which is my preference over urllib. This returns all the html from the web page.

import requests

response  = requests.get('http://stackoverflow.com/questions/34157599/how-do-you-convert-pythons-urllib2-urlopen-to-text')

with open('test.txt' 'w' ) as f:
   f.writelines(response.text)
f.close()
johnfk3
  • 469
  • 2
  • 5
  • 15