7

I would like to save a web page (all content) as a text file. (As if you did right click on webpage -> "Save Page As" -> "Save as text file" and not as html file)

I have tried using the following code:

import urllib2
url=''
page = urllib2.urlopen(url)
page_content = page.read()
file = open('file_text.txt', 'w')
f.write(page_content)
f.close()

My goal is to be able to save a whole text without html code. (for example i would like read "è" instead "&eacute")

Billal Begueradj
  • 20,717
  • 43
  • 112
  • 130
Skipper
  • 83
  • 1
  • 2
  • 8
  • Possible duplicate of [Rendered HTML to plain text using Python](http://stackoverflow.com/questions/13337528/rendered-html-to-plain-text-using-python) – pnovotnak Feb 03 '16 at 00:16
  • One thing - you open 'file', but write and close 'f'. Name needs to be consistent. – recurvata Aug 26 '16 at 14:10

1 Answers1

4

Have a look at html2text as mentioned elsewhere

import urllib2
import html2text
url=''
page = urllib2.urlopen(url)
html_content = page.read()
rendered_content = html2text.html2text(html_content)
file = open('file_text.txt', 'w')
file.write(rendered_content)
file.close()
Max Bethke
  • 286
  • 1
  • 2
  • 18
pnovotnak
  • 4,341
  • 2
  • 27
  • 38
  • Hi pnovotnak, Thank you! I saw the library html2text but when i do "import" it returns an error. `import html2text ~ ImportError: No module named html2text` I use python 2.7 on windows and I can not understand how I can add the library "html2text" for obtain a correct use. (I tried also with python 3.5 but i had the same problem) – Skipper Feb 03 '16 at 09:52
  • No worries :) You need to install it, since it's not part of the standard Python library. See here: http://python-packaging-user-guide.readthedocs.org/en/latest/installing/ – pnovotnak Feb 03 '16 at 20:22
  • 1
    I'm getting the error **TypeError: a bytes-like object is required, not 'str'** on the line *rendered_content = html2text...* – jeppoo1 Mar 11 '20 at 15:13