python extract id value from href source

Question

I've managed to extract the href URI's using beautifulsoup from the source of the page, however I now want to extract the UID value from multiple instances of the example below:

e.g

<a href="test.html?uid=5444974">
<a href="test.html?uid=5444972">
<a href="test.html?uid=54444972">

Help would be greatly appreciated!

If you can extract the `href` attribute, then [urlparse](http://docs.python.org/2/library/urlparse.html) will help you — Dan Lecocq, Jul 16 '13 at 15:53
http://stackoverflow.com/a/11281019/594589 as @DanLecocq suggested — dm03514, Jul 16 '13 at 15:54

score 1 · Accepted Answer · answered Jul 16 '13 at 15:58

>>> html
'<a href="test.html?uid=5444974">\n<a href="test.html?uid=5444972">\n<a href="test.html?uid=54444972">'
>>> soup = BeautifulSoup(html)
>>> ass = soup.find_all('a')
>>> r = re.compile('uid=(\d+)')
>>> uids = []
>>> for a in ass:
...     uids.append(r.search(a['href']).group(1))
... 
>>> uids
['5444974', '5444972', '54444972']
>>>

score 1 · Answer 2 · answered Jul 16 '13 at 15:59

Use urlparse and parse_qs:

html = """<a href="test.html?uid=5444974">
<a href="test.html?uid=5444972">
<a href="test.html?uid=54444972">
"""

from bs4 import BeautifulSoup as BS
from urlparse import urlparse, parse_qs
soup = BS(html)
for a in soup('a', href=True):
    print parse_qs(urlparse(a['href']).query)['uid'][0]

Output:

5444974
5444972
54444972

python extract id value from href source

2 Answers2

Linked