regex to extract a set number of words around a matched word

Question

I was looking around for a way to grab words around a found match, but they were much too complicated for my case. All I need is a regex statement to grab, lets say 10, words before and after a matched word. Would anybody be able to help me set up a pattern to do that?

For example, let's take the sentence (won't make sense):

    sentence = "The hairy yellow, stinkin' dog, sat round' the c4mpfir3 and ate the brown/yellow smore's that the kids(*adults) were makin."

and let's say we want to match 3 words before and after smore's (already cleaned to match). The output would be:

   "ate the brown/yellow smore's that the were"

now lets take the example of wanting to take one word before and after stinkin' :

   "yellow, stinkin' dog"

Another example. "sat":

   "yellow, stinkin' dog, round' the and

Let's make a new sentence now:

   sentence = "If the problem is still there after 30 minutes. Give up"

If I was trying to match the word there, and take 2 words before and after the output would be:

   "is still there after minutes"

I know it's not 10, but I think you get the example? If not, let me know and I will provide more. As I made this, I realized how much more I want than I originally thought. I'm rather new to regex, but I'm going to give the pattern a shot.

    ('[a-zA-Z\'.,/]{3}(word_to_match)[a-zA-Z\'.,/]{3}')

Thanks

In your particular world, what do you consider to be a word? I assume that `cat` is a word. What about `jksldfj`? What about `123`? What about `:-)`? What about `foo_bar`? What about `don't`? What about `貓`? What about `-'-`? — Mark Byers, Jun 14 '12 at 20:02
I've cleaned the "word" to match in regex, now I just need to be able to grab the words around it. Is that what you mean? — Brandon Lile, Jun 14 '12 at 20:08
he means what your separator(s) will be, what ligatures you will match etcetera — Hedde van der Heide, Jun 14 '12 at 20:14
Tell us what **you** think a word is. Personally I think that "jksldfj" isn't a word, but maybe you live on an alien planet where that sequence of letters happens to be a word. Or maybe you think that every sequence of consecutive letters is a word. Or maybe you think that words consist of one or more letter, numbers and underscores. Your definition of "word" is most likely different from mine, so you're going to have to tell us what your definition is. — Mark Byers, Jun 14 '12 at 20:15
any word you'd find in the dictionary. A word to me really isn't anything that would have a number, underscore, or any of "those" characters. So if I really had to think about it, it would be a sequence of consecutive letters — Brandon Lile, Jun 14 '12 at 20:19
You really *do* have to think about it, if you want a pattern to match them :) — Mitya, Jun 14 '12 at 20:23
@user1443094: OK, at least it now appears that you've understood my question. However you are very far from having clearly defined what a word is. I suggest you think about it a bit more then update your question with a clear description of what you mean by word, and give some good example inputs and show what output you want for those examples. Ten examples might be enough if you choose them carefully and cover the interesting cases. — Mark Byers, Jun 14 '12 at 20:30

score 1 · Answer 1 · answered Jun 14 '12 at 20:54

1

This regex will get you started

((?:\w*\s*){2})\s*word3\s*((?:\s*\w*){2})

Group 1 will have the words before your target and group 2 will have the words that come after

In the example I choose to capture 2 words but you can adjust this at will.

Let me know how it goes and if it works on your input.

You can improve your question by reading this short advice http://worksol.be/regex.html

enter image description here

answered Jun 14 '12 at 20:54

buckley

13,690
3
53
61

why you need `\s*` before and after `word3`, when there are already in before and after groups? Also why `\w*` and not `\w+`? – Ωmega Jun 14 '12 at 21:23
I wouldn't have downvoted, it's a good start. But it will only work if there's no punctuation. – alexis Jun 14 '12 at 21:32
@alexis I'm still assuming that the OP is willing to understand what we give him. Tweaking the regex to cater for every edge case in advance is a waste of time. If the OP has trouble tweaking it (s)he can ask again. – buckley Jun 14 '12 at 21:35
That's no excuse for giving him broken code. Especially since it's so easy to use \W instead of \s (hint). – alexis Jun 14 '12 at 21:36
@alexis It's an excuse to make the OP think for himself. Don't see why \W is such an improvement over \s. Can you clarify? – buckley Jun 14 '12 at 21:41
Because it will match everything that's not a letter, not just whitespace. You can use it to count out words in a real sentence and not get stuck on punctuation. – alexis Jun 14 '12 at 21:42
@alexis I would expect the OP to give a counter example then. It's a learning process for her as well so she can see how a regex is altered to cater for specific cases. Your suggestion is a good one and the OP will have learned from this in the comments. Also, I'm sure that every possible regex we can think of will have cases that doesn't meet the requirements of the OP. So OP, if you want it tweaked and don't know how to let us know with a counter example. – buckley Jun 14 '12 at 21:48
just for the record I didn't down vote any of the examples, so why downvote the question? – Brandon Lile Jun 14 '12 at 22:17
@user1443094 I don't see a reason to downvote as it's a good question – buckley Jun 15 '12 at 09:57
@user144, your question is fine, a couple of the commenters just got carried away with their requests. Welcome to stackoverflow, and don't let them discourage you (or push you around) – alexis Jun 15 '12 at 19:19

score 1 · Answer 2 · answered Jun 14 '12 at 22:03

1

Here's a likely definition of "word": A string of non-space characters. Here's another: A string of letters and digits, but no punctuation. Python has convenient shortcuts for both.

\w is any "word" character with the second meaning (letters and digits), and \W is any other character. Use it like this:

m = re.search(r'((\w+\W+){0,4}grab(\W+\w+){0,4})', sentence)
print m.groups()[0]

If you prefer the first definition, just use \S (any character that's not a space) and \s (any space character):

re.search(r'((\S+\s+){0,4}grab(\s+\S+){0,4})', sentence)

You'll notice I'm matching zero to four words before and after. That way if your word is third in the sentence, you'll still get a match. (Searches are "greedy" so you'll always get four if it's possible).

answered Jun 14 '12 at 22:03

alexis

48,685
16
101
161

This should be the answer because it works. Just a comment, in order to use a variable `var` instead of fixed words and ensure matches considering word boundaries `\b` I suggest the following modification to the regex: r'((\w+\W+){0,4}\b'+var+r'\b(\W+\w+){0,4})' – alemol Apr 15 '20 at 19:53
Thanks for the suggestion; there are many other ways to interpolate a word into a regex, and I would favor writing `r'((\w+\W+){0,4}\b%s\b(\W+\w+){0,4})' % var` (or one of the "modern" string interpolation syntaxes Python provides). Also `var` should be `re.escape(var)`, in case it contains characters meaningful to regex syntax. But all that goes beyond the question here... – alexis Apr 16 '20 at 21:44
@alexis I know this is old, but is there any way to make this code work with apostrophes? Right now, if you have the word "don't" before or after, it treats it as 2 words instead of one. – Joe Sep 23 '20 at 12:05
You must mean the first regex. To treat the `'` as a "word" character, replace `\w` (only the lowercase one) with `[\w']` in both places: `m = re.search(r'(([\w']+\W+){0,4}grab(\W+[\w']+){0,4})', sentence)` That's a character range, and fortunately you can use the `\w` alias inside it. – alexis Sep 23 '20 at 12:59

regex to extract a set number of words around a matched word

2 Answers2

Linked