Using lxml or ??? to extract information from webpages

Question

currently I have the following code:

# Import der Pythonmodule
import urllib
import lxml
import mechanize
import sys

# Verbindung zum URL aufbauen
try:
    URL = urllib.urlopen("http://...")

except:
    print "Verbindung zum URL fehlgeschlagen"
    sys.exit(0)

# Quellcode des URL lesen 
URL_quellcode = URL.readlines()

# Verbindung zum URL beenden
URL.close()

So far so good, I can open and read the source of an URL. Now I want to look through various possibilities to extract something.

Possibility 1: <p class="author-name">Some Name</p>
Possibility 2: rel="author">Some Name</a>

I want to extract the author name. My logic would be the following:

Check all classes for "author-name" - if found give me the text inside the tag. If not found check for "rel="author" - if found give me the text inside the tag. If not print "No Author Found"

How would I do that? I can use regex, lxml, or whatever. What would be the most elegant way?

score 3 · Accepted Answer · answered Oct 06 '14 at 13:25

3

Use BeautifulSoup.

from bs4 import BeautifulSoup

document_a = """
<html>
    <body>
        <p class="author-name">Some Name</p>
    </body>
</html>
"""

document_b = """
<html>
    <body>
        <p rel="author-name">Some Name</p>
    </body>
</html>
"""
def get_author(document):
    soup = BeautifulSoup(document_a)
    p = soup.find(class_="author-name")
    if not p:
        p = soup.find(rel="author-name")
        if not p:
            return "No Author Found"
    return p.text

print "author in first document:", get_author(document_a)
print "author in second document:", get_author(document_b)

Result:

author in first document: Some Name
author in second document: Some Name

answered Oct 06 '14 at 13:25

Kevin

74,910
12
133
166

awesome, works like a charm. I started with BS now, really fun! Anyways, I was wondering how this would work with an unknown number of URLs. I'll be loading them from a .txt file and thus I cannot do like document_a .b .c and so on. Basically the Output would then be URL, Authorname as a list with one print operation. – eLudium Oct 06 '14 at 15:10
In that case, you'd do something like `print [url, get_author(get_document(url)) for url in my_file]`. You'll have to write a `get_document` function that retrieves the HTML data from a given url. – Kevin Oct 06 '14 at 15:30

Using lxml or ??? to extract information from webpages

1 Answers1