0

I have to parse this HTML:

<a href="rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp"><img src="http://i.ytimg.com/vi/IQY6jukWn-o/default.jpg?w=80&amp;h=60&amp;sigh=izeIwhz4POtPOOr-jRGrtC4qiFA" alt="video" width="80" height="60" style="border:0;margin:0px;" /></a>

I am looking for all the links ending with .3gp.

I am using BeautifulSoup and it really makes me mad, many things didn't work like if you search for a specific text, it always return empty list.

Have tried:

comment = soup.find(text=re.compile(".3gp")) 
p.campbell
  • 98,673
  • 67
  • 256
  • 322
jack
  • 45
  • 1
  • 6

3 Answers3

2

When you search for text you are looking for all of the NavigableString objects that match your regular expression (Which is looking for any character followed by a 3, a g and a b -- use \.3agb if you want to match .3agb literally with a regex).

Use soup.findAll and search for any <a> tags with an href that match what you want in this way:

soup.findAll('a', attrs={'href': re.compile(".3gp")})
#or
soup.findAll('a', href=re.compile(".3gp"))

SEE: http://www.crummy.com/software/BeautifulSoup/documentation.html#The basic find method: findAll(name, attrs, recursive, text, limit, **kwargs)

Sean Vieira
  • 155,703
  • 32
  • 311
  • 293
  • 'video_tags = page.findAll('a') video_list=[] for video_tag in video_tags: url = video_tag.get('href') video_list.append(url)' – jack Mar 04 '11 at 03:13
0

For this particular problem, Regular Expressions are probably good enough. I know about RegEx match open tags except XHTML self-contained tags (the first answer is awsumness) but this problem seems like a quick hack needed to do something totally different

In [1]: import re

In [2]: a = """...THE TEXT YOU PASTED.."""

In [3]: re.findall('".*?3gp"', a)
Out[3]: ['"rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp"']
Community
  • 1
  • 1
Aditya Mukherji
  • 9,099
  • 5
  • 43
  • 49
0

Pyparsing's makeHTMLTags expression will give you results similar to regex, but with automatic results names (like named groups), and tolerance of many HTML idiosyncracies:

>>> from pyparsing import *
>>>
>>> h = """<a href="rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYE
SARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp"><img src="h
ttp://i.ytimg.com/vi/IQY6jukWn-o/default.jpg?w=80&amp;h=60&amp;sigh=izeIwhz4POtP
OOr-jRGrtC4qiFA" alt="video" width="80" height="60" style="border:0;margin:0px;"
 /></a>"""
>>>
>>> aTag = makeHTMLTags("A")[0]
>>> result = aTag.parseString(h)
>>> print result.dump()
['A', ['href', 'rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp'], False]
- empty: False
- href: rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp
- startA: ['A', ['href', 'rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp'], False]
  - empty: False
  - href: rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp
>>> print result.href
rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp

If you have many anchor tags, and just want those ending in ".3gp" then do:

>>> _3gp_links = [a.href for a in aTag.searchString(h) if a.href.endswith(".3gp")]
PaulMcG
  • 62,419
  • 16
  • 94
  • 130