Find Hyperlinks in Text using Python (Follow-up to another post)

Question

In regards to (Extracting a URL in Python) I have a follow-up question. Note: I'm new to SO and Python, so feel free to correct me on etiquette.

I pulled the regex from the above post and this works fine for me:

myString = """ <iframe width="640" height="390" src="http://www.youtube.com/embed/24WIANESD7k?rel=0" frameborder="0" allowfullscreen></iframe> """
print re.search("(?P<url>https?://[^\s]+)", myString).group("url")

However what I really need to do is loop through a data set that I have previously retrieved from a database. So I did the below, which gives me a strange error, also below.

# Note: "data" here is actually a list of strings, not a data set     
for pseudo_url in data:
        print re.search("(?P<url>https?://[^\s]+)", str(pseudo_url)).group("url")

Error:

Traceback (most recent call last):
  File "find_and_email_bad_press_urls.py", line 136, in <module>
    main()
  File "find_and_email_bad_press_urls.py", line 14, in main
    scrubbed_urls = extract_urls_from_raw_data(raw_url_data)
  File "find_and_email_bad_press_urls.py", line 47, in extract_urls_from_raw_data
    print re.search("(?P<url>https?://[^\s]+)", str(pseudo_url)).group("url")
AttributeError: 'NoneType' object has no attribute 'group'

When I Google this I find tons of irrelevant posts, so I was hoping SO could shed some light. My hunch is that the regex is blowing up on some null data, special character, etc., but I don't know enough about Python to figure it out. Casting to a string didn't help either.

Any ideas or workarounds to power through this would be much appreciated!

I suggest you try the BeautifulSoup module for scraping data from HTML pages. Your error says that the regex returned no matches, and therefore a `None` object, which has no `group` attribute. — Blender, Jun 06 '12 at 02:50

alan · Answer 1 · 2012-06-06T02:59:11.363

2

Your regex is not finding a url in every string in data. You should check to make sure you have a match before making the call to group:

for pseudo_url in data:
    m = re.search("(?P<url>https?://[^\s]+)", pseudo_url)
    if m:
        print m.group("url")

You don't need the call to str() either if pseudo_url is already a string.

And as @Blender suggested in his comment, if data is really lines read from an HTML file, you may want to consider using Beautiful Soup instead of regex for this.

edited Jun 06 '12 at 02:59

answered Jun 06 '12 at 02:54

alan

4,752
21
30

Thanks Blender & alan - that worked like a charm. As I'm new to Python, I have no idea what the group function does. Obviously if you're trying to perform an operation on null data you're going to have issues! Now I just need to learn a little about group to feel comfortable. Thanks again! – Adam Jun 06 '12 at 03:01
@Adam: Learn to love the docs: http://docs.python.org/library/re.html#re.MatchObject.group - and for the record, regex is not the best way to go about this. BeautifulSoup or another parser is a better solution. – Daenyth Jun 06 '12 at 03:53
1

@Adam glad it helped. If this answered your question, you might consider accepting the answer. – alan Jun 06 '12 at 13:47

Find Hyperlinks in Text using Python (Follow-up to another post)

1 Answers1