-2

I have text files which contain html tags which I want to remove using html2text with Python:

import html2text
html = open("textFileWithHtml.txt").read()
print html2text.html2text(html)

My question is how can I write the output to a .txt file ? (I want to create the new text file without the html elements -- the file does not previously exist)

adrCoder
  • 3,145
  • 4
  • 31
  • 56

2 Answers2

3

You need to open another file for writing.

import html2text
html = open("textFileWithHtml.txt")
f = html.read()
w = open("out.txt", "w")
w.write(html2text.html2text(f).encode('utf-8'))
html.close()
w.close()
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • You should probably use with – Padraic Cunningham Feb 19 '15 at 09:39
  • I am getting the following error : Traceback (most recent call last): File "removeHtml.py", line 4, in w.write(html2text.html2text(html)) UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 5040: ordinal not in range(128) – adrCoder Feb 19 '15 at 09:41
  • try `w.write(html2text.html2text(html).encode('utf-8'))` see http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20 – Avinash Raj Feb 19 '15 at 09:44
  • Thanks. Seems to work. I get an error Traceback (most recent call last): File "removeHtml.py", line 5, in html.close() AttributeError: 'str' object has no attribute 'close' but when I put comments on html.close() it works. – adrCoder Feb 19 '15 at 09:48
3

You should open a file and write to it.

import html2text

# Open your file
with open("textFileWithHtml.txt", 'r') as f_html:
    html = f_html.read()

# Open a file and write to it
with open('your_file.txt', 'w') as f:
    f.write(html2text.html2text(html).encode('utf-8'))

It is good practice to use the with keyword when dealing with file objects.

And it is more pythonic too.
See more information for files reading / writing files : https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files


Edit

If you have issues with encoding, try using .encode('utf-8'). I've added it in my code snipped. Look for python unicode if you have issues regarding this (https://docs.python.org/2/howto/unicode.html)

d6bels
  • 1,432
  • 2
  • 18
  • 30
  • I am getting this error : Traceback (most recent call last): File "removeHtmlPrintToFile.py", line 4, in w.write(html2text.html2text(html)) UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 5040: ordinal not in range(128) – adrCoder Feb 19 '15 at 09:43
  • Thanks d6bels you and Avinash both got it right with the encoding. – adrCoder Feb 19 '15 at 10:15