2

I see that NLTK recommends using BeautifulSoup get_text() to proprocess HTML to text for subsequent NLP analysis. But it does not seem to work very well. In the following example, xyz, and abc are concantenated, but they should not be. Is there any better preprocessing utilty for converting HTML to text for NLP applications?

$ cat main.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1:

html_doc = "<h2>xyz</h2><p>abc</p>"

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print soup.get_text()
$ ./main.py 
xyzabc
user1424739
  • 11,937
  • 17
  • 63
  • 152
  • Did you see my answer? Can you mark it as correct if it works / answer the question or at least up-vote it? – amirouche Jul 08 '19 at 19:33

1 Answers1

1

I recommend you use html2text tool. Here is a test run in the command line:

$ html2text --ignore-links https://content.cultureandempire.com/chapter1.html 

  * Culture & Empire
  *   * __Introduction
  * __**1.** Preface 
  * __**2.** Chapter 1 - Magic Machines 
  * __**3.** Chapter 2 - Spheres of Light 
  * __**4.** Chapter 3 - Faceless Societies 
  * __**5.** Chapter 4 - Freedom in Chains 
  * __**6.** Chapter 5 - Eyes of the Spider 
  * __**7.** Chapter 6 - Wealth of Nations 
  * __**8.** Chapter 7 - March of the Kaiju 
  * __**9.** Chapter 8 - The Reveal 
  * __**10.** Postface 
  * __**11.** Appendix 1 
  *   * Published with GitBook 

#  __Culture & Empire

# Chapter 1. Magic Machines

> Far away, in a different place, a civilization called Culture had taken
seed, and was growing. It owned little except a magic spell called Knowledge.

In this chapter, I'll examine how the Internet is changing our society. It's
happening quickly. The most significant changes have occurred during just the
last 10 years or so. More and more of our knowledge about the world and other
people is transmitted and stored digitally. What we know and who we know are
moving out of our minds and into databases. These changes scare many people,
whereas in fact they contain the potential to free us, empowering us to
improve society in ways that were never before possible.

## From Bricks to Bits

Otherwise, you can use lxml.html.Element.text_content() or python's textract

amirouche
  • 7,682
  • 6
  • 40
  • 94