Regular Expression to extract specific text from string

Question

I am new to Regex and try to extract a 16x character piece of text from a list of strings.

Sample list:

myString = ['  pon-3-1    |    UnReg 5A594F4380661123           1234567890               Active',
            '  pon-3-1    |    UnReg 5A594F43805FA456           1234567890               Active',
            '  pon-3-1    |    UnReg 4244434D73B24789           1234567890               Active', 
            '  pon-3-1    |    UnReg 5A594F43805FB000           1234567890               Active',
            'sw-frombananaramatoyourmama-01'
           ]

I cannot use a simple regex like (\w{16}) as this will include all text with 16 characters. I also tried (\w+A) which, depending on the characters in the string, don't return the correct results.

newArry = []
for i in myString:
   number = re.search('(\w{16})', i)
   newArr.append(number[0])

print(newArr)

Returns:

['5A594F4380661123', '5A594F43805FA456', '4244434D73B24789', '5A594F43805FB000', 'frombananaramato']

I want to extract only:
- 5A594F4380661123
- 5A594F43805FA456
- 4244434D73B24789
- 5A594F43805FB000

Any ideas?

Many thanks in advance

Well how do *you* distinguish those specific 16-character substrings from others? If they're always uppercase, for example, use that fact. You could also look at the word boundaries, regex supports that. — jonrsharpe, Aug 15 '19 at 07:34
@jonrsharpe Yes, that's right. Always in caps. Note, though, I am new to Regex and not yet sure how to combine the numbers with letters with caps. — Stephan Du Toit, Aug 15 '19 at 07:45
Then see e.g. https://stackoverflow.com/questions/4736/learning-regular-expressions — jonrsharpe, Aug 15 '19 at 07:53
@politicalscientist. Thanks so much it seems to work. I prefer to use the re.search method as it will prevent me from making to many additional changes to the rest of the code in the application. But will use it if I don't come right with another option. — Stephan Du Toit, Aug 15 '19 at 07:58

tripleee · Answer 1 · 2019-08-15T08:08:05.843

1

If you want to make sure the 16 characters are surrounded by non-alphabetics, try

re.search(r'\b([0-9A-F]{16})\b', i)

The \b "word boundary" operator matches on a position which is surrounded by alphabetics on one side and nonalphabetics on the other.

(If you want to be more specific about which nonalphabetics exactly, you can use lookarounds:

re.search(r'(?<![0-9A-F])([0-9A-F]{16})(?![0-9A-F])', i)

where (?<!...) says "cannot be preceded by ..." and (?!...) says "cannot be followed by ...".)

You'll also notice that I tightened up the character class to only match hex digits, which by itself is already sufficient to solve your example problem, and used r'...' raw strings for the regexes, which you should probably always do (at least until you completely understand how backslashes in Python non-raw strings are mangled).

edited Aug 15 '19 at 08:08

answered Aug 15 '19 at 07:52

tripleee

175,061
34
275
318

Whoa! That's impressive thanks. Both solutions work flawlessly. I also tested @בנימין כהן snippet which works perfectly. If his and your first snippet are to be compared, I assume your snippets provide a more strict and accurate search for the string? Or are they just 2x different ways of achieving the same result. – Stephan Du Toit Aug 15 '19 at 08:04
The other answer doesn't restrict the context of the match, so it will still pick out the first 16 of e.g. a string of 20 characters. With the boundary markers, you don't match a substring of a longer string of hex. – tripleee Aug 15 '19 at 10:08

score 0 · Answer 2 · answered Aug 15 '19 at 07:40

0

use a regex set

number = re.search("([\dABCDEF]{16})", i)

this will search for any 16 length string with any digit (\d), 'A', 'B', 'C', 'D', 'E' or 'F'

answered Aug 15 '19 at 07:40

בנימין כהן

624
3
14

Wolf · Answer 3 · 2019-08-15T07:58:20.630

0

Be more specific in you regex: tell it what you know!

If you obviously realize that the actual results differ from the expected ones in some specific way, try to gain an advantage from it.

\w matches letters ([A-Za-z]), numbers ([0-9]) and _, you seem to are searching for 16 hexadecimal digits. Build a specific character class.

Another observation is that you want the 16-hex-digit blocks enclosed in spaces, this can be expressed by the character regex around the capturing part befor(capt)after or by adding anchors/boundaries.

edited Aug 15 '19 at 07:58

answered Aug 15 '19 at 07:42

Wolf

9,679
7
62
108

Noted, thanks! I have received a few proposed solutions. Going to test each and then study the method behind it. – Stephan Du Toit Aug 15 '19 at 08:00
@StephanDuToit think also of the non-capturing parts to describe the context. Good luck :) – Wolf Aug 15 '19 at 08:04

score 0 · Answer 4 · answered Aug 15 '19 at 07:47

0

You can try this, assuming the HEX code is always preceded by UnReg

re.findall(r'UnReg\s+([\dA-F]{16})',';'.join(myString))

answered Aug 15 '19 at 07:47

Sarath Sadasivan Pillai

6,737
29
42

Thanks. Seems to work great. I prefer the re.search method due to the existing structure of function within which the regex expression resides. However, if I don't come right I will use this snippet. – Stephan Du Toit Aug 15 '19 at 08:02

score 0 · Answer 5 · answered Aug 15 '19 at 07:48

0

Use re.findall to avoid for-loop. I'd specify UnReg in the pattern (if there is one in your real data), so that regex doesn't include other 16-character pieces of text.

>>> import re
>>> newArr = re.findall(r'UnReg\s(.{16})', ' '.join(myString))
>>> print(newArr)
['5A594F4380661123', '5A594F43805FA456', '4244434D73B24789', '5A594F43805FB000']

answered Aug 15 '19 at 07:48

help-ukraine-now

3,850
4
19
36

1

Ah I see what you mean. Makes sense. Ok let me test and get back to you. – Stephan Du Toit Aug 15 '19 at 08:09
@StephanDuToit did it work btw? – help-ukraine-now Aug 16 '19 at 18:50
1

Sorry for the late response. Yes, it did to an extent. Once I started using very large lists I picked up some small issues like incorrect matching of certain types of numbers. But thats merely because your code was based on the information I provided. I should have provided more detail. Anyways, I ended up using a slightly modified version of @triplee code: (?<![0-9A-Z])([0-9A-Z]{16})(?![0-9A-Z]) – Stephan Du Toit Aug 18 '19 at 11:10

Regular Expression to extract specific text from string

5 Answers5