1

I'm trying to extract tokens/part of tokens that have numeric/alphanumeric characters that have a length greater than 8 from the text.

Example:

text = 'https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8'

The expected output would be :

59800512 510557XXXXXX2302 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg 69i57j0i22i30l8j0i390 4672j0j7

I have tried using the regular expression : ((\d+)|([A-Za-z]+\d)[\dA-Za-z]*) based on the answer Python Alphanumeric Regex. I got the following results :

[match for match in re.findall(r"((\d+)|([A-Za-z]+\d)[\dA-Za-z]*)",text)] 

Output :
[('59800512', '59800512', ''),
 ('510557', '510557', ''),
 ('XXXXXX2302', '', 'XXXXXX2'),
 ('1601371803', '1601371803', ''),
 ('NhLw6NlR0EksRWkLddEo7NiEvrg', '', 'NhLw6'),
 ('69', '69', ''),
 ('i57j0i22i30l8j0i390', '', 'i5'),
 ('4672', '4672', ''),
 ('j0j7', '', 'j0'),
 ('8', '8', '')]

I'm getting a tuple of matching groups for each matching token.

It is possible to filter these tuples again. But I'm trying to make the code as efficient and pythonic as possible.

Could anyone suggest a solution? It need not be based on regular expressions.

Thanks in advance

Edit : I expect alphanumeric values of length equal to or greater than 8

Suneha K S
  • 312
  • 1
  • 13
  • 1
    Please do note that when you have a requirement like "*a length greater than 8*", it seems strange that you expect results like "4672j0j7" and "59800512". – JvdV May 04 '21 at 08:38
  • 1
    @ JvdV - My mistake while stating the question. I expect the alphanumeric values of length equal to or greater than 8. That was exactly the reason for adding the examples: "4672j0j7" and "59800512" in the text string. I have edited the question. – Suneha K S May 04 '21 at 08:51

3 Answers3

3

You get the tuples in the result, as re.findall returns the values of the capture groups.

But you can omit the capture groups and change the pattern to a single match, matching at least a digit between chars A-Z a-z and assert a minimum of 8 characters using a positive lookahead.

\b(?=[A-Za-z0-9]{8})[A-Za-z]*\d[A-Za-z\d]*\b
  • \b A word boundary
  • (?=[A-Za-z0-9]{8}) Positive lookahead, assert at least 8 occurrences of any of the listed ranges
  • [A-Za-z]* Optionally match a char A-Z a-z
  • \d Match a digit
  • [A-Za-z\d]* Optionall match a char A-Z a-z or a digit
  • \b A word boundary

See a regex demo or a Python demo.

import re
from pprint import pprint

pattern = r"\b(?=[A-Za-z0-9]{8})[A-Za-z]*\d[A-Za-z\d]*\b"
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"

pprint(re.findall(pattern, s))

Output

['59800512',
 '510557XXXXXX2302',
 '1601371803',
 'NhLw6NlR0EksRWkLddEo7NiEvrg',
 '69i57j0i22i30l8j0i390',
 '4672j0j7']
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
2

I came up with:

\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b

See an online demo

  • \b - Word boundary.
  • [A-Za-z]{,7} - 0-7 times a alphachar.
  • \d - A single digit.
  • [A-Za-z\d]{7,} - 7+ times an alphanumeric char.
  • \b - Word boundary.

Some sample code:

import re
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"
result = re.findall(r'\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b', s)
print(result)

Prints:

['59800512', '510557XXXXXX2302', '1601371803', 'NhLw6NlR0EksRWkLddEo7NiEvrg', '69i57j0i22i30l8j0i390', '4672j0j7']

You could opt to match case-insensitive with:

(?i)\b[a-z]{,7}\d[a-z\d]{7,}\b
JvdV
  • 70,606
  • 8
  • 39
  • 70
  • 1
    @ JvdV - It is very simple and it worked. I'm going to select this answer as it is quite simple. Thanks a lot for the answer and the explanation. – Suneha K S May 04 '21 at 09:08
0

Although the selected answer returns the required output, it is not generic, and it fails to match specific cases (eg., s= "thisword2H2g2d")

For a more generic regex that works for all combinations of alphanumeric values:

result = re.findall(r"(\d+[A-Za-z\d]+\d*)|([A-Za-z]+[\d]+[A-Za-z\d]*)")

See the demo here.

Bill
  • 315
  • 3
  • 18