1

If, for example, there is a video website which has a search option.

http://example.com/search=query

and it return all the search results in that form:

<a href="LinkToVideo"</a><img src="ImageSource" alt="AltDescription"><b>VideoName</b>

I want to use this data, so i send a request to the website, and then use re to return a list with LinkToVideo, ImageSource, AltDescription and VideoName:

response = urllib2.urlopen("http://example.com/search=" + query)
resp = response.read()
search_list = re.compile('<a href="(.+?)"</a><img src="(.+?)" alt="(.+?)"><b>(.+?)</b>').findall(resp)
return search_list

and it return a list like this:

[('example.com/video1.mp4', 'example.com/image1.jpg', 'blah blah ', 'Cats'),('example.com/video2.mp4', 'example.com/image2.jpg', 'blah', 'Dogs'),('example.com/video3.mp4', 'example.com/image3.jpg', 'blah blah blah', 'Zebra')]

The problem is that i dont need the alt description, but it changes.

I want that list will look like this:

[('example.com/video1.mp4', 'example.com/image1.jpg', 'Cats'), ('example.com/video2.mp4', 'example.com/image2.jpg', 'Dogs'), ('example.com/video3.mp4', 'example.com/image3.jpg','Zebra')]

I know i can ignore this, but it the real site (this is just an example) the list is much bigger and there is more data i need to ignore.

I searched google and didnt find a solution. Im sorry if the title isnt describe the problem exactly.

Thanks

user3611091
  • 47
  • 1
  • 8

1 Answers1

2

Use a non-capturing group ((?:…)) like this:

'<a href="(.+?)"</a><img src="(.+?)" alt="(?:.+?)"><b>(.+?)</b>'

Or just get rid of the group entirely:

'<a href="(.+?)"</a><img src="(.+?)" alt=".+?"><b>(.+?)</b>'

I should also point out that using regular expressions to parse arbitrary HTML is a pretty bad idea and has been known to cause madness. I'd strongly recommend using a proper html parser instead.

Community
  • 1
  • 1
p.s.w.g
  • 146,324
  • 30
  • 291
  • 331