Detecting alphanumeric/numeric values in python string

Question

I'm trying to extract tokens/part of tokens that have numeric/alphanumeric characters that have a length greater than 8 from the text.

Example:

text = 'https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8'

The expected output would be :

59800512 510557XXXXXX2302 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg 69i57j0i22i30l8j0i390 4672j0j7

I have tried using the regular expression : ((\d+)|([A-Za-z]+\d)[\dA-Za-z]*) based on the answer Python Alphanumeric Regex. I got the following results :

[match for match in re.findall(r"((\d+)|([A-Za-z]+\d)[\dA-Za-z]*)",text)] 

Output :
[('59800512', '59800512', ''),
 ('510557', '510557', ''),
 ('XXXXXX2302', '', 'XXXXXX2'),
 ('1601371803', '1601371803', ''),
 ('NhLw6NlR0EksRWkLddEo7NiEvrg', '', 'NhLw6'),
 ('69', '69', ''),
 ('i57j0i22i30l8j0i390', '', 'i5'),
 ('4672', '4672', ''),
 ('j0j7', '', 'j0'),
 ('8', '8', '')]

I'm getting a tuple of matching groups for each matching token.

It is possible to filter these tuples again. But I'm trying to make the code as efficient and pythonic as possible.

Could anyone suggest a solution? It need not be based on regular expressions.

Thanks in advance

Edit : I expect alphanumeric values of length equal to or greater than 8

Please do note that when you have a requirement like "*a length greater than 8*", it seems strange that you expect results like "4672j0j7" and "59800512". — JvdV, May 04 '21 at 08:38
@ JvdV - My mistake while stating the question. I expect the alphanumeric values of length equal to or greater than 8. That was exactly the reason for adding the examples: "4672j0j7" and "59800512" in the text string. I have edited the question. — Suneha K S, May 04 '21 at 08:51

The fourth bird · Answer 1 · 2021-05-04T08:30:27.330

You get the tuples in the result, as re.findall returns the values of the capture groups.

But you can omit the capture groups and change the pattern to a single match, matching at least a digit between chars A-Z a-z and assert a minimum of 8 characters using a positive lookahead.

\b(?=[A-Za-z0-9]{8})[A-Za-z]*\d[A-Za-z\d]*\b

\b A word boundary
(?=[A-Za-z0-9]{8}) Positive lookahead, assert at least 8 occurrences of any of the listed ranges
[A-Za-z]* Optionally match a char A-Z a-z
\d Match a digit
[A-Za-z\d]* Optionall match a char A-Z a-z or a digit
\b A word boundary

See a regex demo or a Python demo.

import re
from pprint import pprint

pattern = r"\b(?=[A-Za-z0-9]{8})[A-Za-z]*\d[A-Za-z\d]*\b"
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"

pprint(re.findall(pattern, s))

Output

['59800512',
 '510557XXXXXX2302',
 '1601371803',
 'NhLw6NlR0EksRWkLddEo7NiEvrg',
 '69i57j0i22i30l8j0i390',
 '4672j0j7']

@ The fourth bird - Thanks a lot for the answer and the explanation. It worked perfectly. — Suneha K S, May 04 '21 at 09:08

score 2 · Accepted Answer · answered May 04 '21 at 08:22

I came up with:

\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b

See an online demo

\b - Word boundary.
[A-Za-z]{,7} - 0-7 times a alphachar.
\d - A single digit.
[A-Za-z\d]{7,} - 7+ times an alphanumeric char.
\b - Word boundary.

Some sample code:

import re
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"
result = re.findall(r'\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b', s)
print(result)

Prints:

['59800512', '510557XXXXXX2302', '1601371803', 'NhLw6NlR0EksRWkLddEo7NiEvrg', '69i57j0i22i30l8j0i390', '4672j0j7']

You could opt to match case-insensitive with:

(?i)\b[a-z]{,7}\d[a-z\d]{7,}\b

@ JvdV - It is very simple and it worked. I'm going to select this answer as it is quite simple. Thanks a lot for the answer and the explanation. — Suneha K S, May 04 '21 at 09:08

score 0 · Answer 3 · answered Sep 27 '21 at 13:15

Although the selected answer returns the required output, it is not generic, and it fails to match specific cases (eg., s= "thisword2H2g2d")

For a more generic regex that works for all combinations of alphanumeric values:

result = re.findall(r"(\d+[A-Za-z\d]+\d*)|([A-Za-z]+[\d]+[A-Za-z\d]*)")

See the demo here.

Detecting alphanumeric/numeric values in python string

3 Answers3