Extract exact words or set of characters using Regex in Python

Question

Suppose I have a list like this.

List = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209'].

I want to search and return a match where 'PO' is there. Technically I should have RUC_PO-345 as my output, but even RUC_POLO-209 is getting returned as an output along with RUC_PO-345.

And what's your criterion for matching? What have you tried so far? — Shubham Sharma, Apr 28 '20 at 06:18
I am getting the word 'caterpillar' as well for the search 'cat' , 'doggy' and 'doggo' for the search 'dog'. I just want the words 'cat', 'dog', etc and not anything else apart from those to be matched. — Ricky Rick, Apr 28 '20 at 06:21
Does this answer your question? [whole word match in javascript](https://stackoverflow.com/questions/2232934/whole-word-match-in-javascript) — Nick, Apr 28 '20 at 06:22
You need to use word breaks (`\b`) in the regex around the word. That will prevent `cat` matching `caterpillar` — Nick, Apr 28 '20 at 06:23
This is just simply strange, you are searching for the word "cat" within a list and want to return "cat" if it's found? Why `Regex`? Why not simply check if your word is [`in`](https://www.w3schools.com/python/ref_keyword_in.asp) the list? I think you are using the wrong approach here... — JvdV, Apr 28 '20 at 06:29
can you give an example using break? I mean how the expression should be? @Nick — Ricky Rick, Apr 28 '20 at 06:30
@JvdV No, if I search for the word 'cat' from the list, the word 'caterpillar' is also getting involved because the 'cat' portion from the word 'caterpillar' is getting matched as well. — Ricky Rick, Apr 28 '20 at 06:31
@Nick Your solution worked. What if there are noises like integers, special characters involved as well. Will I get the exact match 'cat' with your solution? Just a quick check. — Ricky Rick, Apr 28 '20 at 06:35
@Rick it won't work if you search for `cat` and there is `cat1` in your string, as `cat1` does not have a word boundary between `cat` and `1` as `1` is considered a word character as well. But if that is the case, you need to update your question as none of the answers you have so far will deal with that situation. — Nick, Apr 28 '20 at 06:46
It’s good practice in regex to always try to apply an abstract form of the *anchors* in some way (I.e. anchor the beginning and end of your pattern to some unique aspect of the string we are looking to match). Capture groups can be used by surrounding the capture target with ‘()’ and call the .group(1) method on the `re` object to only return the capture subset of the match for better regex control. — jameshollisandrew, Apr 30 '20 at 09:59
For example, you could be very strict and use a pattern `‘^\w{3}_PO-\d{3}$’` if you knew this string started and finished the line (like in a list of codes). The ‘^’ and ‘$’ anchor the pattern to the beginning and end of a line. If we wanted to only return the ‘PO-345’ part of this string, we can use a capture group `‘^\w{3}_(PO-\d{3})$’` and access the capture by calling `.group(1)` on the match object. Then, 'RUC_PO-345' is matched and 'PO-345' is returned. (This might not be the case here but wanted to comment on anchoring and capturing). — jameshollisandrew, Apr 30 '20 at 10:07

JvdV · Answer 1 · 2020-04-28T08:00:45.793

Before updated question:

As per my comment, I think you are using the wrong approach. To me it seems you can simply use in:

words = ['cat', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
if 'cat' in words:
    print("yes")
else:
    print("no")

Returns: yes

words = ['cats', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
if 'cat' in words:
    print("yes")
else:
    print("no")

Returns: no

After updated question:

Now if your sample data does not actually reflect your needs but you are interested to find a substring within a list element, you could try:

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = 'PO'
r = re.compile(fr'(?<=_){srch}(?=-)')
print(list(filter(r.findall, words)))

Or using match:

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = 'PO'
r = re.compile(fr'^.*(?<=_){srch}(?=-).*$')
print(list(filter(r.match, words)))

This will return a list of items (in this case just ['RUC_PO-345']) that follow the pattern. I used the above regular pattern to make sure your searchvalue won't be at the start of the searchstrings, but would be after an underscore, and followed by a -.

Now if you have a list of products you want to find, consider the below:

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = ['PO', 'QW']
r = re.compile(fr'(?<=_)({"|".join(srch)})(?=-)')
print(list(filter(r.findall, words)))

Or again using match:

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = ['PO', 'QW']
r = re.compile(fr'^.*(?<=_)({"|".join(srch)})(?=-).*$')
print(list(filter(r.match, words)))

Both would return: ['MX_QW-765', 'RUC_PO-345']

Note that if you don't have f-strings supported you can also concat your variable into the pattern.

Some random query. Suppose I have a list like this List = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']. If I apply the same logic in this and search only for the string containing 'PO', technically I should have 'RUC_PO-345' as my output, but even 'RUC_POLO-209' is getting returned as an output along with 'RUC_PO-345'. — Ricky Rick, Apr 28 '20 at 06:41
@Rick, to me nothing get's returning simply because `PO` is **not** in the list. But you are now going from full string matches to substring matches... which one is it? It appears your sample data does not reflect your actual needs. — JvdV, Apr 28 '20 at 06:43
can you help me out in this problem? How to work on sub-string matches like this one? — Ricky Rick, Apr 28 '20 at 06:56
@Rick, I have updated my answer with one of possibly multiple ways of dealing with this situation. If you don't have f-strings you can concat your pattern too. — JvdV, Apr 28 '20 at 07:02

score 1 · Answer 2 · answered Apr 28 '20 at 06:20

1

Try building a regex alternation using the search terms in the list:

words = ['cat', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
your_text = 'I like cat, dog, rabbit, antelope, and monkey, but not giraffes'
regex = r'\b(?:' + '|'.join(words) + r')\b'
print(regex)
matches = re.findall(regex, your_text)
print(matches)

This prints:

\b(?:cat|caterpillar|monkey|monk|doggy|doggo|dog)\b
['cat', 'dog', 'monkey']

You can clearly see the regex alternation which we built to find all matching keywords.

answered Apr 28 '20 at 06:20

Tim Biegeleisen

502,043
27
286
360

words = ['cat', 'monk', 'dog'] your_list = ['caterpillar', 'dog', 'doggo', 'cat', 'monkey', 'doggy'] I need to find only 'cat' from your_list but 'caterpillar' is getting involved as well while searching for the word 'cat. – Ricky Rick Apr 28 '20 at 06:26
1

@Rick No, that's not happening, because my regex pattern only searches for `\bcat\b`, which cannot match `catepillar`. – Tim Biegeleisen Apr 28 '20 at 06:31

jameshollisandrew · Answer 3 · 2020-04-28T08:25:19.887

1

The pattern:

‘_PO[^\w]’

should work with a re.search() or re.findall() call; it will not work with a re.match as it doesn’t consider the characters at the beginning of the string.

The pattern reads: match 1 underscore (‘_’) followed by 1 capital P (‘P’) followed by 1 capital O (‘O’) followed by one character that is not a word character. The special character ‘\w’ matches [a-zA-Z0-9_].

‘_PO\W’

^ This might also be used as a shorter version to the first pattern suggested (credit @JvdV in comments)

‘_PO[^A-Za-z]’

This pattern uses the, ‘Set of characters not alpha characters.’ In the event the dash interferes with either of the first two patterns.

To use this to identify the pattern in a list, you can use a loop:

import re

For thing in my_list:
    if re.search(‘_PO[^\w]’, thing) is not None:
        # do something
        print(thing)

This will use the re.search call to match the pattern as the True condition in the if conditional. When re doesn’t match a string, it returns None; hence the syntax of...if re.search() is not None.

Hope it helps!

edited Apr 28 '20 at 08:25

answered Apr 28 '20 at 07:48

jameshollisandrew

1,143
9
10

1

This could be simplified to `_PO\W`, but should be a fine alternative to lookarounds I suppose =). Upvoted – JvdV Apr 28 '20 at 08:13
Added your suggestion in body with @credit to you. Good suggestion! – jameshollisandrew Apr 28 '20 at 08:21
Please no need to mention me. It was positive criticism and our goal is to provide better and more concise answers together. – JvdV Apr 28 '20 at 08:24
No no it’s a great addition I appreciate it! – jameshollisandrew Apr 28 '20 at 08:25

score 0 · Answer 4 · answered Apr 28 '20 at 06:20

0

You need to add a $ sign which signifies the end of a string, you can also add a ^ which is the start of a string so only cat matches:

 ^cat$

answered Apr 28 '20 at 06:20

score 0 · Answer 5 · answered Apr 28 '20 at 06:59

We can try matching one of the three exact words 'cat','dog','monk' in our regex string.

Our regex string is going to be "\b(?:cat|dog|monk)\b"

\b is used to define word boundary. We use \b so that we could search for whole words (this is the exact problem you were facing). Adding this would not match tomcat or caterpillar and only cat

Next, (?:) is called Non capturing group (Explained here )

Now we need to match either one of cat or dog or monk. So this is expressed as cat|dog|monk. In python 3 this would be:

import re

words = ['cat', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
regex = r"\b(?:cat|dog|monk)\b"

r=re.compile(regex)
matched = list(filter(r.match, words))

print(matched)

To implement matching regex through an iterable list, we use filter function as mentioned in a Stackoverflow answer here

You can find the runnable Python code here

NOTE: Finally, regex101 is a great online tool to try out different regex strings and get their explanation in real-time. The explanation for our regex string is here

Omnifarious · Answer 6 · 2020-05-01T05:02:27.927

0

You should be using a regular expression (import re), and this is the regular expression you should be using: r'(?<![A-Za-z0-9])PO(?![A-Za-z0-9])'.

I previously recommended the \b special sequence, but it turns out the '_' is considered part of a word, and that isn't the case for you, so it wouldn't work.

This leaves you with the somewhat more complex negative look behind and negative lookahead assertions, which is what (?<!... and (?!... are, respectively. To understand how those work, read the documentation for Python regular expressions.

edited May 01 '20 at 05:02

answered Apr 28 '20 at 07:06

Omnifarious

54,333
19
131
194

1

Be aware that `_` (underscore)` is considered a word-char (hence it's within the `\w` or `[a-zA-Z0-9_]` range. Which is exactly the character that is in front of the substring OP is interested in... Your proposed solution will not work. Try it in [here](https://regex101.com/r/iT1oKD/1) for example. – JvdV Apr 28 '20 at 07:18
`r‘_RO\b’` can match if you want to use the word boundary special. – jameshollisandrew Apr 28 '20 at 07:56
@JvdV - Grr. *sigh* Well, then all that's left is negative look behind and negative look ahead assertions. I'll fix my answer. – Omnifarious May 01 '20 at 04:58

Extract exact words or set of characters using Regex in Python

6 Answers6