0

I have tried to extract text from html page using traditional beautiful soup method. I have followed the code from another SO answer.

import urllib
from bs4 import BeautifulSoup

url = "http://orizon-inc.com/about.aspx"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

I am able to extract text using this correctly for most of the pages. But I there occurs new line between the words in the paragraph for some particular pages like the one I've mentioned.

result:

\nAt Orizon, we use our extensive consulting, management, technology and\nengineering capabilities to design, develop,\ntest, deploy, and sustain business and mission-critical solutions to government\nclients worldwide.\nBy using proven management and technology deployment\npractices, we enable our clients to respond faster to opportunities,\nachieve more from their operations, and ultimately exceed\ntheir mission requirements.\nWhere\nconverge\nTechnology & Innovation\n© Copyright 2019 Orizon Inc., All Rights Reserved.\n>'

In the result there occurs a new line between technology and\nengineering, develop,\ntest,etc.

These are all the text inside the same paragraph.

If we view it in html source code it is correct:

<p>
            At Orizon, we use our extensive consulting, management, technology and 
            engineering capabilities to design, develop, 
        test, deploy, and sustain business and mission-critical solutions to government 
            clients worldwide. 
    </p>
    <p>
            By using proven management and technology deployment 
            practices, we enable our clients to respond faster to opportunities, 
            achieve more from their operations, and ultimately exceed 
            their mission requirements.
    </p>

What is the reason for this? and how can I extract it accurately?

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
DGS
  • 475
  • 1
  • 4
  • 16

2 Answers2

2

Instead of splitting the text per line, you should be splitting the text per HTML tag, since for each paragraph and title, you want the text inside to be stripped of line breaks.

You can do that by iterating over all elements of interest (I included p, h2 and h1 but you can extend the list), and for each element, strip it of any newlines, then append a newline to the end of the element to create a line break before the next element.

Here's a working implementation:

import urllib.request
from bs4 import BeautifulSoup

url = "http://orizon-inc.com/about.aspx"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,'html.parser')

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# put text inside paragraphs and titles on a single line
for p in soup(['h1','h2','p']):
    p.string = " ".join(p.text.split()) + '\n'

text = soup.text
# remove duplicate newlines in the text
text = '\n\n'.join(x for x in text.splitlines() if x.strip())

print(text)

Output sample:

login

About Us

At Orizon, we use our extensive consulting, management, technology and engineering capabilities to design, develop, test, deploy, and sustain business and mission-critical solutions to government clients worldwide.

By using proven management and technology deployment practices, we enable our clients to respond faster to opportunities, achieve more from their operations, and ultimately exceed their mission requirements.

If you don't want a gap between paragraphs/titles, use:

text = '\n'.join(x for x in text.splitlines() if x.strip())
glhr
  • 4,439
  • 1
  • 15
  • 26
-1

if you only want content from paragraph tags then try this

paragraph = soup.find('p').getText()
al76
  • 754
  • 6
  • 13