My goal is to extract numbers (both int and float) from this string:
s = 'cs 0 scn /TT0 1 Tf 0.022 Tc -0.022 Tw 11.04 0 0 11.04 108 723.96 Tm (32)Tj 0 Tc 0 Tw 0.946 0 Td ( )Tj 0.021 Tc -0.01 Tw 5.728 0 Td [(I)-1(N)4(TE)3(R)15(M)-2(E)3(D)15(I)-1(A)4(TE)]Tj'
p = r'\s-?\d+(\.\d{1,3})?\s'
Since the decimal point will be followed by 1-3 digits, the \.\d{1,3}
part has to be grouped (i.e., placed within the parantheses) and followed by a ?
since it's optional.
However, using such that regex with re.findall(p, s)
gives me this:
['', '', '.022', '.022', '.04', '', '', '', '', '.946', '.021', '.01', '.728']
Only the parts to the right of decimal point are extracted. So, I tried placing the entire number, including the optional decimal part, inside parantheses:
p = r'\s(-?\d+(\.\d{1,3})?)\s'
re.findall(p, s)
# Result
>>> [('1', ''), ('0.022', '.022'), ('-0.022', '.022'), ('11.04', '.04'), ('0', ''), ('108', '')]
But it returns a list of tuples, where each pair contains the entire match and the decimal part separately.
Further, it fails to match 723.96
which comes right after 108
(which is matched). As far as I can understand, both are surrounded by spaces, but one is matched while the other isn't.
Using \b
instead of \s
matches 723.96
but it also matches the numbers in the [(I)-1(N)4(TE)3(R)15(M)-2(E)3(D)15(I)-1(A)4(TE)]
part of the string. I don't want these to be matched.
Just in case I haven't been clear about my goal, here's the desired output:
['0', '1', '0.022', '-0.022', '11.04', '0', '0', '11.04', '108', '723.96', '0', '0', '0.946', '0', '0.021', '-0.01', '5.728', '0']
Apart from getting solution to this problem, I'd like to understand the behavior of regex patterns I used.