0

In regards to (Extracting a URL in Python) I have a follow-up question. Note: I'm new to SO and Python, so feel free to correct me on etiquette.

I pulled the regex from the above post and this works fine for me:

myString = """ <iframe width="640" height="390" src="http://www.youtube.com/embed/24WIANESD7k?rel=0" frameborder="0" allowfullscreen></iframe> """
print re.search("(?P<url>https?://[^\s]+)", myString).group("url")

However what I really need to do is loop through a data set that I have previously retrieved from a database. So I did the below, which gives me a strange error, also below.

# Note: "data" here is actually a list of strings, not a data set     
for pseudo_url in data:
        print re.search("(?P<url>https?://[^\s]+)", str(pseudo_url)).group("url")

Error:

Traceback (most recent call last):
  File "find_and_email_bad_press_urls.py", line 136, in <module>
    main()
  File "find_and_email_bad_press_urls.py", line 14, in main
    scrubbed_urls = extract_urls_from_raw_data(raw_url_data)
  File "find_and_email_bad_press_urls.py", line 47, in extract_urls_from_raw_data
    print re.search("(?P<url>https?://[^\s]+)", str(pseudo_url)).group("url")
AttributeError: 'NoneType' object has no attribute 'group'

When I Google this I find tons of irrelevant posts, so I was hoping SO could shed some light. My hunch is that the regex is blowing up on some null data, special character, etc., but I don't know enough about Python to figure it out. Casting to a string didn't help either.

Any ideas or workarounds to power through this would be much appreciated!

Community
  • 1
  • 1
Adam
  • 461
  • 4
  • 2
  • 3
    I suggest you try the BeautifulSoup module for scraping data from HTML pages. Your error says that the regex returned no matches, and therefore a `None` object, which has no `group` attribute. – Blender Jun 06 '12 at 02:50

1 Answers1

2

Your regex is not finding a url in every string in data. You should check to make sure you have a match before making the call to group:

for pseudo_url in data:
    m = re.search("(?P<url>https?://[^\s]+)", pseudo_url)
    if m:
        print m.group("url")

You don't need the call to str() either if pseudo_url is already a string.

And as @Blender suggested in his comment, if data is really lines read from an HTML file, you may want to consider using Beautiful Soup instead of regex for this.

alan
  • 4,752
  • 21
  • 30
  • Thanks Blender & alan - that worked like a charm. As I'm new to Python, I have no idea what the group function does. Obviously if you're trying to perform an operation on null data you're going to have issues! Now I just need to learn a little about group to feel comfortable. Thanks again! – Adam Jun 06 '12 at 03:01
  • @Adam: Learn to love the docs: http://docs.python.org/library/re.html#re.MatchObject.group - and for the record, regex is not the best way to go about this. BeautifulSoup or another parser is a better solution. – Daenyth Jun 06 '12 at 03:53
  • 1
    @Adam glad it helped. If this answered your question, you might consider accepting the answer. – alan Jun 06 '12 at 13:47