0

I have a list of URLS from which I am trying to fetch just the id numbers. I am trying to solve this out using the combination of URLParse and regular expressions. Here is how my function looks like:

def url_cleanup(url):
    parsed_url = urlparse(url)
    if parsed_url.query=="fref=ts":
        return 'https://www.facebook.com/'+re.sub('/', '', parsed_url.path)
    else:
        qry =  parsed_url.query
        result = re.search('id=(.*)&fref=ts',qry)
        return 'https://www.facebook.com/'+result.group(1)

However, I feel that the regular expression result = re.search('id=(.*)&fref=ts',qry) fails to match some of the URLs as explained in the below example.

#1 
id=10001332443221607 #No match

#2 
id=6383662222426&fref=ts #matched

I tried to take the suggestion as per the suggestion provided in this answer by rephrasing my regular expression as id=(.*).+?(?=&fref=ts) which again matches #2 but not #1 in the above examples.

I am not sure what I am missing here. Any suggestion/hint will be much appreciated.

kingmakerking
  • 2,017
  • 2
  • 28
  • 44
  • There are a few online regex testers that use the Python flavor, They are very convenient for crafting patterns. https://regex101.com/ is one. Have you tried `'id=(\d*)'` for a pattern? – wwii Dec 13 '16 at 16:15

2 Answers2

2

Your regex's are wrong, indeed.

using the expression id=(.*)&fref=ts you will only match ids succeded by &fref=ts literally.

using id=(.*).+?(?=&fref=ts) you will do the same thing, but using the lookahead, which is a non-capturing group expression. This means that your match will be only the id=blablabla part, but only if it's succeded by &fref=ts.

Moreover, id=(.*) will match ids comprised of numbers, letters, symbols... literally anything. Using id=\d+ will match 'numbers only' ids.

So, try using

result = re.search('id=(\d+)', qry)

it will allow you to catch just the numbers, supposing your ids are always digits, and capture(using the parenthesis) only these digits for later use.

For further reference, refer to http://www.regular-expressions.info/python.html

Victor Lia Fook
  • 420
  • 4
  • 15
1

Your regex needs tweaking slightly. Try:

result = re.search('id=(\d+)(&fref=ts)?', qry)

id=(\d+) matches any number of digits following id=, and (&fref=ts)? allows the following group of letters to be optional. This would allow you to add them back in if necessary.

You should also note that this will throw an error if no match is found - so you might want to change slightly to:

result = re.search('id=(\d+)(&fref=ts)?', qry)
if result:
    return 'https://www.facebook.com/'+result.group(1)
else:
    # some error catch
asongtoruin
  • 9,794
  • 3
  • 36
  • 47