0

Let's say I have this html here:

ul_tags = [u'<p>If you\u2019re in the pet food industry:</p><ul><li>What challenges do you face on a regular basis</li><li>What is your biggest struggle </li></ul>''']

I want to write it to a text file so that in the file it would look similar to what it would look like on a webpage:

enter image description here

I do:

import nltk
import codecs
with codecs.open('test.txt', 'a', encoding="utf8") as file:
    for tag in ul_tags:
        file.write(nltk.clean_html(tag) + '\n')

When that gets written to file it looks like this:

If you’re in the pet food industry: What challenges do you face on a regular basis What is your biggest struggle

It's just a line of text. What's the best way to make it look like its original structure on a web page?

Mika Schiller
  • 425
  • 1
  • 8
  • 19
  • http://stackoverflow.com/questions/9184107/how-can-i-force-pythons-file-write-to-use-the-same-newline-format-in-windows – ThatTechGuy Feb 21 '16 at 22:57
  • 1
    NLTK support for clean_html is disappearing. They recommend you use [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) instead. I'd suggest that you make a list of tags that deserve things like newlines, or newline plus indentation, and use BS to convert your text. – aghast Feb 21 '16 at 23:22

0 Answers0