-1

I'm using Python 3.7. I want to extract the portion of a url between the "q=...&" part of a query string. I have this code

    href = span.a['href']
    print("href:" + href)
    matchObj = re.match( r'q=(.*?)\&', href, re.M|re.I)
    if matchObj:
        criteria = matchObj.group(1)

but despite the fact that my href is this

href:/search?hl=en-US&q=bet+i+won+t+get+one+share&tbm=isch&tbs=simg:CAQSkwEJyapBtj9kKiIahwELEKjU2AQaAAwLELCMpwgaYgpgCAMSKMILxAufFcsLnBWeFZsVnRWABMcPsCKgLaMtoi2hLZ0tqziiI6w4uSQaMG01mL5LQ62s4q5ZMf-Wetz68lCkHfrFOOKs2CELzQJlPjHIMzmlp2Ny-a5t7hZbiCAEDAsQjq7-CBoKCggIARIEXLNODAw&sa=X&ved=0ahUKEwjThcCx59ziAhWKHLkGHfWjDs4Q2A4ILCgB

the "matchObj" is always NoneType and the subsequent lines aren't evaluated. What else do I need to do to fix my regex?

Miss Chanandler Bong
  • 4,081
  • 10
  • 26
  • 36
Dave
  • 15,639
  • 133
  • 442
  • 830

3 Answers3

1

You can use the urllib module

Ex:

import urllib.parse as urlparse
url = "href:/search?hl=en-US&q=bet+i+won+t+get+one+share&tbm=isch&tbs=simg:CAQSkwEJyapBtj9kKiIahwELEKjU2AQaAAwLELCMpwgaYgpgCAMSKMILxAufFcsLnBWeFZsVnRWABMcPsCKgLaMtoi2hLZ0tqziiI6w4uSQaMG01mL5LQ62s4q5ZMf-Wetz68lCkHfrFOOKs2CELzQJlPjHIMzmlp2Ny-a5t7hZbiCAEDAsQjq7-CBoKCggIARIEXLNODAw&sa=X&ved=0ahUKEwjThcCx59ziAhWKHLkGHfWjDs4Q2A4ILCgB"
data = urlparse.urlparse(url)
print(urlparse.parse_qs(data.query)['q'][0])

Output:

bet i won t get one share
Rakesh
  • 81,458
  • 17
  • 76
  • 113
0

You're using the wrong function if you wish to match in the middle of the string. re.match only matches from start of the string

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.

Here use re.search instead.

import re
href = 'href:/search?hl=en-US&q=bet+i+won+t+get+one+share&tbm=isch&tbs=simg:CAQSkwEJyapBtj9kKiIahwELEKjU2AQaAAwLELCMpwgaYgpgCAMSKMILxAufFcsLnBWeFZsVnRWABMcPsCKgLaMtoi2hLZ0tqziiI6w4uSQaMG01mL5LQ62s4q5ZMf-Wetz68lCkHfrFOOKs2CELzQJlPjHIMzmlp2Ny-a5t7hZbiCAEDAsQjq7-CBoKCggIARIEXLNODAw&sa=X&ved=0ahUKEwjThcCx59ziAhWKHLkGHfWjDs4Q2A4ILCgB'
print("href:" + href)
matchObj = re.search( r'q=(.*?)\&', href, re.M|re.I)
if matchObj:
    criteria = matchObj.group(1)
print(criteria)
'bet+i+won+t+get+one+share'
Paritosh Singh
  • 6,034
  • 2
  • 14
  • 33
0

Here, we would apply a simple expression with left and right boundaries such as:

&q=(.+?)&

Demo

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"&q=(.+?)&"

test_str = "href:/search?hl=en-US&q=bet+i+won+t+get+one+share&tbm=isch&tbs=simg:CAQSkwEJyapBtj9kKiIahwELEKjU2AQaAAwLELCMpwgaYgpgCAMSKMILxAufFcsLnBWeFZsVnRWABMcPsCKgLaMtoi2hLZ0tqziiI6w4uSQaMG01mL5LQ62s4q5ZMf-Wetz68lCkHfrFOOKs2CELzQJlPjHIMzmlp2Ny-a5t7hZbiCAEDAsQjq7-CBoKCggIARIEXLNODAw&sa=X&ved=0ahUKEwjThcCx59ziAhWKHLkGHfWjDs4Q2A4ILCgB

"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Emma
  • 27,428
  • 11
  • 44
  • 69