Regex - Python - Capture everything between a word

Question

Is it possible to capture specific sentences that contain as a keyword (time)? example:

`I want to capture this part (time) and this part. Not this sentence though because it does not contain our keyword. But also this sentence because it contains (time)'

-Note 1: The time is not in parenthesis originally and represents time frame: e.g: 12:45, 10:45 etc.

-Note 2: I am looking for a regex that captures all sentences when this keyword exists. If the findall function does not find the keyword in the sentence then it continues to the next sentence.

-Note 3: In the end we have a sum of sentences that contain a specific keyword.

I have added some additional information. Testing the codes that you have provided me and a text.

text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"

capture_1 = re.findall("(?:\.|\A)(.*\d*:\d*.*)\.", text , flags=re.DOTALL)
capture_2 = re.findall(r'(\..*)(\d*:\d*)(.*) ',text, flags=re.DOTALL )

capture_1 gives me this:

['He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14'])

capture_2 gives me this:

[('. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00', ':14', '. The police found his body 10 minutes after the')])

I want the following sentences though: [(. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace.Time of death was 00:14')]

Is each sentence a line of text? Or do you want the regex to find the endings of sentences, too? — Hellmar Becker, Feb 28 '16 at 15:46
I want the regex to find where the sentence starts and where it stops actually. But only to include the sentences that have the keyword somewhere in them — D1W1TR15, Feb 28 '16 at 15:47
You could do it without regex, like: `if 'keyword' in sentences: #append to list; else: continue`. — Quinn, Feb 28 '16 at 16:12
So I managed actually to do some part but it is not that all. So I manage to capture everything from the beginning of the sentence with (\..*) (keyword) but I find it hard to capture the rest part of the sentence. — D1W1TR15, Feb 28 '16 at 16:13
No I do not need just to capture the word man. I need the whole sentence. I need to do it with regex because I want to capture sentences from big raw texts, so you know. — D1W1TR15, Feb 28 '16 at 16:14
*Mr. Smith loves his Ford 2.0 tdci.* You need a [natural language parser](http://stackoverflow.com/a/4576110/5527985). Then you can check sentences for matching a keyword. — bobble bubble, Feb 28 '16 at 16:35
why to make it so complicated? All I need is just a regex that captures: every sentence that has a word inside them. Thank you for the link though but I think nltk is different approach and more complicated. — D1W1TR15, Feb 28 '16 at 16:41
In the example capture_1 and capture_2, what is the keyword? The time? Better to make it clear. — Quinn, Feb 28 '16 at 16:57
@DimitrisTsoukalas Because it is complicated. [In general you can't rely on one single Great White infallible regex](http://stackoverflow.com/a/25735848). Anyway there's tons of regexs if you google for `regex split text into sentences`. If any is sufficient for you, use it for splitting and match the keyword in each. — bobble bubble, Feb 28 '16 at 16:58
@ccf thank for letting me know and sorry for the trouble! I changed that. — D1W1TR15, Feb 28 '16 at 17:03
@bobble bubble, I think there must be a way somehow because look at these magnificent regex down bellow. They almost solve my question. — D1W1TR15, Feb 28 '16 at 17:03
Much better explanation. check out solution below. it will match anything between given keywords. positive and negative lookaround will be your friends. — Saleem, Feb 28 '16 at 17:20
[See this demo at ideone](http://ideone.com/i7ClFK). The problem is, that it will only cover a minimal part of possible cases and the regex is expensive. — bobble bubble, Feb 28 '16 at 18:56
@bobblebubble damn man. This is masterpiece ;). Bravo and many thanks! Really details and it managed to capture the issue that I did not mention regarding this: Most of the cases, after the hour there is a 15:32 A.M. Many regex that I tried stopped at 15:32 A. because they consider the dot after A, so they were stopping. Indeed Regex is expensive! — D1W1TR15, Feb 29 '16 at 15:54
@D1W1TR15 Great that helps you, but still leaves tons of cases that could occur : ) you're welcome. — bobble bubble, Feb 29 '16 at 16:02
@bobblebubble. Can I ask why u used Mrs in the parenthesis? So as to avoid capturing a potential " Mrs. " ?? — D1W1TR15, Mar 07 '16 at 11:21
@D1W1TR15 Just to skip period at `Mrs.` abbreviation. The `Mr.` is already covered by `(?<!\b[A-Z][a-z])`. — bobble bubble, Mar 07 '16 at 12:11

Quinn · Accepted Answer · 2016-02-29T05:09:42.580

1

UPDATE2 Just figured out a pattern. The demo is HERE. Hope it helps:

(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])

Explanation:

(?:^|\s+)       Non-capturing group,
                match start of sentence, or 1 or more spaces
(               capturing group starts
[^.!?]*         0 or more times of characters except . ! or ?
(?:\d\d:\d\d)   Non-capturing group,
                match dd:dd time format
[^.!?]*         0 or more times of characters except . ! or ?
[.!?]           sentence ends with . ! or ?
)               capturing group ends

import re
text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
print  ' '.join( re.findall('(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])', text))

Output:

The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. Time of death was 00:14.

edited Feb 29 '16 at 05:09

answered Feb 28 '16 at 16:24

Quinn

4,394
2
21
19

hmmm nope. Tried that. I have updated the question above. It seems that it gives me another pattern from that I want. – D1W1TR15 Feb 28 '16 at 16:46
@DimitrisTsoukalas: Please try this updated pattern. – Quinn Feb 29 '16 at 05:10
Perfect ~ H need to understand that though. So I manage to understand what you are doing at the middle part and at the first part. What about the last part? [^.!?]*[.!?] ~ I saw the demo and many thanks for that but I still did not understand the last part. Many thanks! – D1W1TR15 Feb 29 '16 at 15:14
`[^.!?]*` matches anything but sentence terminating mark. `[.!?]` is the sentence terminating mark. We have to make sure the sentence terminating mark only appears in the end. – Quinn Feb 29 '16 at 15:57

bunji · Answer 2 · 2016-02-28T17:53:12.623

(?:\.|\A)([^.]*\d*:\d*[^.]*)\.

This captures all strings between two periods or between the beginning of the string and a period (so you can capture the first sentence too). If your string contains line breaks, you will want to use the re.DOTALL flag to make sure that . captures new lines.

For example:

re.findall("(?:\.|\A)([^.]*\d*:\d*[^.]*)\.", text, flags=re.DOTALL)

Note that this will get all your sentences that contain your keyword at once so there is no need to go through sentence by sentence.

EDIT:

I have changed the regex above to capture every sentence that contains your keyword EXCEPT when the keyword is immediately adjacent to a .
If I can suggest another technique using a list comprehension:

[s for s in re.split('\.', text) if re.search('\d*:\d*', s)]

which for your example returns:

[' The terrorist destroyed the building at 23:45 with a remote detonation device',' 
He escaped at 23:58 from the balcony of the terrace', 
'Time of death was 00:14']

Note that this will still run into problems if your text contains . that are not sentence final. For example: "Mr. Magoo ate beans and toast at 12:34" will capture: "Magoo ate beans at 12:34" and will miss the "Mr." .

If you run into this problem I would recommend asking it as a separate question.

I have updated my script to make it more clear. I am not sure that this code provides me with what I want though :/... hmmm — D1W1TR15, Feb 28 '16 at 16:46
Yeah it works fine and many thanks about your effort. But can I ask what is exactly the purpose of ":" and "|" at the first part (?:\.|\A)? I cannot understand the meaning of | = or logic there and neither the ":" one. Thank you in advance. — D1W1TR15, Feb 29 '16 at 15:00
(?:\.|\A) is a [non-capturing group](http://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group) that signifies either a period `\.` or the beginning of the string `\A`. The `:` is part of the non-capturing group syntax and the `|` is an "or" operator just like you thought. The intention is to let you capture the first sentence since it will not fall between two periods like the rest of the sentences (hence the `\A`) — bunji, Mar 01 '16 at 03:23

Saleem · Answer 3 · 2016-02-28T19:50:40.593

Well, you can achieve this easily with regex. (positive lookbehind and lookahead)

Here is an example of using above regex.

import re


def replace_keyword(start, end, data):
    if start == "":
        start = "^"

    if end == "":
        end = "$"

    rx = "(?<={0}).*(?={1})".format(start, end)
    match = re.search(rx, data, re.DOTALL | re.MULTILINE)
    if match:
        return match.group() + end
    else:
        return data


data = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"

# empty string means start searching from begining of string.
start = ""

# empty end string means, search until end of string.
end = "00:14"

data = replace_keyword(start, end, data)

print data

after running above code, data will contain text

He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14

Hopefully, it's doing what are you expecting

Regex - Python - Capture everything between a word

3 Answers3

EDIT: