0

I have the below through regex and beautifulsoup. I need to extract the UID value e.g 5968723334.

[u'/home.html', u'browse_settings.html', u'browse.html?', u'test.html?uid=5415292833', u'test.html?uid=5968723334', u'test.html?uid=5968723334', u'test.html?uid=5453943714', u'test.html?uid=5453943714', u'test.html?uid=6740871094', u'test.html?uid=6740871094', u'test.html?uid=5991868792', u'test.html?uid=5991868792', u'test.html?uid=25072413', u'test.html?uid=25072413', u'test.html?uid=6739965683', u'test.html?uid=6739965683', u'test.html?uid=7272910004', u'test.html?uid=7272910004', u'test.html?uid=13179298', u'test.html?uid=13179298', u'test.html?uid=5392816266', u'test.html?uid=5392816266', u'test.html?uid=5992588819', u'test.html?uid=5992588819', u'test.html?uid=6727114420', u'test.html?uid=6727114420', u'test.html?uid=7263648884', u'test.html?uid=7263648884', u'test.html?uid=5447240210', u'test.html?uid=5447240210', u'test.html?uid=5460515002', u'test.html?uid=5460515002', u'test.html?uid=5400731231', u'test.html?uid=5400731231', u'browse.html?params=_F_18_24_GB_0___grid_1', u'/home.html?t=1374068507', u'/account_info.html', u'http://www.example.com/browse.html?params=_F_18_24_GB_0___grid_0', u'http://www.example.com/contact.html', u'/logout.html', u'#top', u'/terms_of_service.html', u'http://safety.example.com']

I’ve managed to extract one 'uid' like so, however I'd like to extract all UID's:

>>> m = re.search("uid=(\d*)", soup.contents[0])
>>> print m
<_sre.SRE_Match object at 0x211b210>
>>> print m.group(1)
5442562712

Please help!

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
yodawg
  • 41
  • 4
  • Questions asking for code **must demonstrate a minimal understanding of the problem being solved**. Include attempted solutions, why they didn't work, and the expected results. [See also: Stack Overflow question checklist](http://meta.stackexchange.com/questions/156810/stack-overflow-question-checklist) – HamZa Jul 17 '13 at 13:51
  • 1
    updated to include attempted solution... – yodawg Jul 17 '13 at 13:56
  • Did you try anything from http://stackoverflow.com/questions/17681269/python-extract-id-value-from-href-source/ ? – Jon Clements Jul 17 '13 at 13:59

2 Answers2

0

You can loop through your list and apply the regular expression to each:

uid = re.compile(r"uid=(\d*)")
uids = [match.group(1) for match in filter(None, map(uid.search, list_of_urls))]

The above is a compact version of:

uid = re.compile(r"uid=(\d*)")
uids = []
for url in list_of_urls:
    match = uid.search(url)
    if match is not None:
         uids.append(match.group(1))

The code takes into account that some of your urls do not contain a UID number.

Demo:

>>> import re
>>> list_of_urls = [u'/home.html', u'browse_settings.html', u'browse.html?', u'test.html?uid=5415292833', u'test.html?uid=5968723334', u'test.html?uid=5968723334', u'test.html?uid=5453943714', u'test.html?uid=5453943714', u'test.html?uid=6740871094', u'test.html?uid=6740871094', u'test.html?uid=5991868792', u'test.html?uid=5991868792', u'test.html?uid=25072413', u'test.html?uid=25072413', u'test.html?uid=6739965683', u'test.html?uid=6739965683', u'test.html?uid=7272910004', u'test.html?uid=7272910004', u'test.html?uid=13179298', u'test.html?uid=13179298', u'test.html?uid=5392816266', u'test.html?uid=5392816266', u'test.html?uid=5992588819', u'test.html?uid=5992588819', u'test.html?uid=6727114420', u'test.html?uid=6727114420', u'test.html?uid=7263648884', u'test.html?uid=7263648884', u'test.html?uid=5447240210', u'test.html?uid=5447240210', u'test.html?uid=5460515002', u'test.html?uid=5460515002', u'test.html?uid=5400731231', u'test.html?uid=5400731231', u'browse.html?params=_F_18_24_GB_0___grid_1', u'/home.html?t=1374068507', u'/account_info.html', u'http://www.example.com/browse.html?params=_F_18_24_GB_0___grid_0', u'http://www.example.com/contact.html', u'/logout.html', u'#top', u'/terms_of_service.html', u'http://safety.example.com']
>>> uid = re.compile(r"uid=(\d*)")
>>> [match.group(1) for match in filter(None, map(uid.search, list_of_urls))]
[u'5415292833', u'5968723334', u'5968723334', u'5453943714', u'5453943714', u'6740871094', u'6740871094', u'5991868792', u'5991868792', u'25072413', u'25072413', u'6739965683', u'6739965683', u'7272910004', u'7272910004', u'13179298', u'13179298', u'5392816266', u'5392816266', u'5992588819', u'5992588819', u'6727114420', u'6727114420', u'7263648884', u'7263648884', u'5447240210', u'5447240210', u'5460515002', u'5460515002', u'5400731231', u'5400731231']
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thank you. Very useful, almost there now.[u'5993031380', u'5993031380', u'7272953398', u'7272953398'] – yodawg Jul 17 '13 at 14:09
0

You want findall

>>> contents = [u'/home.html', u'browse_settings.html', u'browse.html?', u'test.html?uid=5415292833', u'test.html?uid=5968723334', u'test.html?uid=5968723334', u'test.html?uid=5453943714', u'test.html?uid=5453943714', u'test.html?uid=6740871094', u'test.html?uid=6740871094', u'test.html?uid=5991868792', u'test.html?uid=5991868792', u'test.html?uid=25072413', u'test.html?uid=25072413', u'test.html?uid=6739965683', u'test.html?uid=6739965683', u'test.html?uid=7272910004', u'test.html?uid=7272910004', u'test.html?uid=13179298', u'test.html?uid=13179298', u'test.html?uid=5392816266', u'test.html?uid=5392816266', u'test.html?uid=5992588819', u'test.html?uid=5992588819', u'test.html?uid=6727114420', u'test.html?uid=6727114420', u'test.html?uid=7263648884', u'test.html?uid=7263648884', u'test.html?uid=5447240210', u'test.html?uid=5447240210', u'test.html?uid=5460515002', u'test.html?uid=5460515002', u'test.html?uid=5400731231', u'test.html?uid=5400731231', u'browse.html?params=_F_18_24_GB_0___grid_1', u'/home.html?t=1374068507', u'/account_info.html', u'http://www.example.com/browse.html?params=_F_18_24_GB_0___grid_0', u'http://www.example.com/contact.html', u'/logout.html', u'#top', u'/terms_of_service.html', u'http://safety.example.com']
>>> import re
>>> m = re.findall("uid=(\d*)", " ".join(contents))
>>> m
[u'5415292833', u'5968723334', u'5968723334', u'5453943714', u'5453943714', u'6740871094', u'6740871094', u'5991868792', u'5991868792', u'25072413', u'25072413', u'6739965683', u'6739965683', u'7272910004', u'7272910004', u'13179298', u'13179298', u'5392816266', u'5392816266', u'5992588819', u'5992588819', u'6727114420', u'6727114420', u'7263648884', u'7263648884', u'5447240210', u'5447240210', u'5460515002', u'5460515002', u'5400731231', u'5400731231']
Ford
  • 2,559
  • 1
  • 22
  • 27