currently I have the following code:
# Import der Pythonmodule
import urllib
import lxml
import mechanize
import sys
# Verbindung zum URL aufbauen
try:
URL = urllib.urlopen("http://...")
except:
print "Verbindung zum URL fehlgeschlagen"
sys.exit(0)
# Quellcode des URL lesen
URL_quellcode = URL.readlines()
# Verbindung zum URL beenden
URL.close()
So far so good, I can open and read the source of an URL. Now I want to look through various possibilities to extract something.
Possibility 1:
<p class="author-name">Some Name</p>
Possibility 2:
rel="author">Some Name</a>
I want to extract the author name. My logic would be the following:
Check all classes for "author-name" - if found give me the text inside the tag. If not found check for "rel="author" - if found give me the text inside the tag. If not print "No Author Found"
How would I do that? I can use regex, lxml, or whatever. What would be the most elegant way?