If, for example, there is a video website which has a search option.
http://example.com/search=query
and it return all the search results in that form:
<a href="LinkToVideo"</a><img src="ImageSource" alt="AltDescription"><b>VideoName</b>
I want to use this data, so i send a request to the website, and then use re
to return a list with LinkToVideo
, ImageSource
, AltDescription
and VideoName
:
response = urllib2.urlopen("http://example.com/search=" + query)
resp = response.read()
search_list = re.compile('<a href="(.+?)"</a><img src="(.+?)" alt="(.+?)"><b>(.+?)</b>').findall(resp)
return search_list
and it return a list like this:
[('example.com/video1.mp4', 'example.com/image1.jpg', 'blah blah ', 'Cats'),('example.com/video2.mp4', 'example.com/image2.jpg', 'blah', 'Dogs'),('example.com/video3.mp4', 'example.com/image3.jpg', 'blah blah blah', 'Zebra')]
The problem is that i dont need the alt description, but it changes.
I want that list will look like this:
[('example.com/video1.mp4', 'example.com/image1.jpg', 'Cats'),
('example.com/video2.mp4', 'example.com/image2.jpg', 'Dogs'),
('example.com/video3.mp4', 'example.com/image3.jpg','Zebra')]
I know i can ignore this, but it the real site (this is just an example) the list is much bigger and there is more data i need to ignore.
I searched google and didnt find a solution. Im sorry if the title isnt describe the problem exactly.
Thanks