0

I know similar questions like this have already been asked on the platform but I checked them and did not find the help I needed.

I have some String such as :

path = "most popular data structure in OOP lists/5438_133195_9917949_1218833? povid=racking these benchmarks"

path = "activewear/2356_15890_9397775? povid=ApparelNavpopular data structure you to be informed when a regression"

I have a function :

def extract_id(path):
    pattern = re.compile(r"([0-9]+(_[0-9]+)+)", re.IGNORECASE)
    return pattern.match(path)

The expected results are 5438_133195_9917949_1218833 and 2356_15890_9397775. I tested the function online, and it seems to produce the expected result but my it's returning None in my app. What am I doing wrong? Thanks.

shadowtalker
  • 12,529
  • 3
  • 53
  • 96
Ktrel
  • 115
  • 1
  • 9

2 Answers2

1

match is used to match an entire statement. What you want is search. You have to use group to retrieve matches from a search. You don't need re.IGNORECASE if you are looking for characters that don't have a case. You should compile your regex only once. Compiling a pattern that never changes, every time a function is called, is not optimal.

You could simplify your expression to ((\d+_?)+)\?, which will find a repeating sequence of one or more \digits that may be followed by an underscore, and is ultimately ended with a question mark

example:

import re

#do this once
pathid = re.compile(r'((\d+_?)+)\?') 

def extract_id(path:str) -> str:
    if m := pathid.search(path): #make sure there is a match
        return m.group(1)        #return match from group 1 `((\d+_?)+)`
    return None                  #no match

#use
path   = "thingsbefore/5438_133195_9917949_1218833?thingsafter"
result = extract_id(path)

#proof
print(result) #5438_133195_9917949_1218833

python regex docs

Your id comes after the last / and before the ?. The below solution will likely be much faster. This doesn't search by pattern, it prunes by position.

def extract_id(path:str) -> str:
    #right of the last / to left of the ?
    return path.split('/')[-1].split('?')[0]

#use
path   = "thingsbefore/5438_133195_9917949_1218833?thingsafter"
result = extract_id(path)

#proof
print(result) #5438_133195_9917949_1218833
OneMadGypsy
  • 4,640
  • 3
  • 10
  • 26
  • Thanks for the help and especially the explanation. It wasn't made clear in my post but not all ids have '?' after them. But the prune option seems to work. Thanks. – Ktrel Oct 30 '22 at 10:22
1

You don't need any capture groups, you can get a match only and return .group() using re.seach:

\b\d+(?:_\d+)+\b
  • \b A word boundary
  • \d+ Match 1+ digits
  • (?:_\d+)+ Repeat 1+ times _ and 1+ digits
  • \b A word boundary

Regex demo

import re

path = "most popular data structure in OOP lists/5438_133195_9917949_1218833? povid=racking these benchmarks"
pattern = re.compile(r"\b\d+(?:_\d+)+\b")
def extract_id(path):
    return pattern.search(path).group()

print(extract_id(path))

Output

5438_133195_9917949_1218833
The fourth bird
  • 154,723
  • 16
  • 55
  • 70