Trying to parse HTML using BeautifulSoup, but it not working?

Question

I have to parse this HTML:

<a href="rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp"><img src="http://i.ytimg.com/vi/IQY6jukWn-o/default.jpg?w=80&h=60&sigh=izeIwhz4POtPOOr-jRGrtC4qiFA" alt="video" width="80" height="60" style="border:0;margin:0px;" /></a>

I am looking for all the links ending with .3gp.

I am using BeautifulSoup and it really makes me mad, many things didn't work like if you search for a specific text, it always return empty list.

Have tried:

comment = soup.find(text=re.compile(".3gp"))

comment = soup.find(text=re.compile(".3gp")) – jack Mar 03 '11 at 23:02 — jack, Mar 03 '11 at 23:02

score 2 · Accepted Answer · answered Mar 03 '11 at 23:25

When you search for text you are looking for all of the NavigableString objects that match your regular expression (Which is looking for any character followed by a 3, a g and a b -- use \.3agb if you want to match .3agb literally with a regex).

Use soup.findAll and search for any <a> tags with an href that match what you want in this way:

soup.findAll('a', attrs={'href': re.compile(".3gp")})
#or
soup.findAll('a', href=re.compile(".3gp"))

SEE: http://www.crummy.com/software/BeautifulSoup/documentation.html#The basic find method: findAll(name, attrs, recursive, text, limit, **kwargs)

'video_tags = page.findAll('a') video_list=[] for video_tag in video_tags: url = video_tag.get('href') video_list.append(url)' — jack, Mar 04 '11 at 03:13

score 0 · Answer 2 · edited May 23 '17 at 12:33

0

For this particular problem, Regular Expressions are probably good enough. I know about RegEx match open tags except XHTML self-contained tags (the first answer is awsumness) but this problem seems like a quick hack needed to do something totally different

In [1]: import re

In [2]: a = """...THE TEXT YOU PASTED.."""

In [3]: re.findall('".*?3gp"', a)
Out[3]: ['"rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp"']

edited May 23 '17 at 12:33

Community

1
1

answered Mar 03 '11 at 23:07

Aditya Mukherji

9,099
5
43
49

Not printing anything, I am getting the data from url not from a defined string. – jack Mar 03 '11 at 23:13

score 0 · Answer 3 · answered Mar 03 '11 at 23:19

Pyparsing's makeHTMLTags expression will give you results similar to regex, but with automatic results names (like named groups), and tolerance of many HTML idiosyncracies:

>>> from pyparsing import *
>>>
>>> h = """<a href="rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYE
SARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp"><img src="h
ttp://i.ytimg.com/vi/IQY6jukWn-o/default.jpg?w=80&amp;h=60&amp;sigh=izeIwhz4POtP
OOr-jRGrtC4qiFA" alt="video" width="80" height="60" style="border:0;margin:0px;"
 /></a>"""
>>>
>>> aTag = makeHTMLTags("A")[0]
>>> result = aTag.parseString(h)
>>> print result.dump()
['A', ['href', 'rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp'], False]
- empty: False
- href: rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp
- startA: ['A', ['href', 'rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp'], False]
  - empty: False
  - href: rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp
>>> print result.href
rtsp://v8.cache2.c.youtube.com/CjgLENy73wIaLwnqnxbpjjoGIRMYESARFEIJbXYtZ29vZ2xlSARSB3Jlc3VsdHNgpq6joefRgbhNDA==/0/0/0/video.3gp

If you have many anchor tags, and just want those ending in ".3gp" then do:

>>> _3gp_links = [a.href for a in aTag.searchString(h) if a.href.endswith(".3gp")]

Trying to parse HTML using BeautifulSoup, but it not working?

3 Answers3