Removing a sentence from a text in dataframe column

Question

I want to format a text-column in the dataframe in a following way:

In entries where the last character of a string is a colon ":" I want to delete the last sentence in this text i.e. a substring starting from a character after the last ".", "?" or "!" and finishing on that colon.

Example df:

index    text
1        Trump met with Putin. Learn more here:
2        New movie by Christopher Nolan! Watch here:
3        Campers: Get ready to stop COVID-19 in its tracks!
4        London was building a bigger rival to the Eiffel Tower. Then it all went wrong.

after formatting should look like this:

index    text
1        Trump met with Putin.
2        New movie by Christopher Nolan!
3        Campers: Get ready to stop COVID-19 in its tracks!
4        London was building a bigger rival to the Eiffel Tower. Then it all went wrong.

I updated the question to not only consider the case where the sentence ends on a full period — Jan Marczak, Feb 12 '22 at 20:44
regex is brittle especially when you have no control over how the text is generated. You will quickly find your regex case set expanding if you want to parse all your rows successfully. — cs95, Feb 12 '22 at 20:44

cs95 · Answer 1 · 2022-02-12T20:44:24.440

0

Using sent_tokenize from the NLTK tokenize API which IMO is the idiomatic way of tokenizing sentences

from nltk.tokenize import sent_tokenize
(df['text'].map(nltk.sent_tokenize)
           .map(lambda sent: ' '.join([s for s in sent if not s.endswith(':')])))

index
1                                Trump met with Putin.
2                      New movie by Christopher Nolan.
3    Campers: Get ready to stop COVID-19 in its tra...
4    London was building a bigger rival to the Eiff...
Name: text, dtype: object

You might have to handle NaNs appropriately with a preceeding fillna('') call if your column contains those.

In list form the output looks like this:

['Trump met with Putin.',
 'New movie by Christopher Nolan.',
 'Campers: Get ready to stop COVID-19 in its tracks!',
 'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.']

Note that NLTK needs to be pip-installed.

edited Feb 12 '22 at 20:44

answered Feb 12 '22 at 20:32

cs95

379,657
97
704
746

your output doesnt show all which kind of takes the fun aWay – Feb 12 '22 at 20:34
@doesitmatter oh yeah.... something to do with my terminal's display options. There, added a listified version of the output ;-) – cs95 Feb 12 '22 at 20:36
what python version are you on that you are using join([ instead of join( – Feb 12 '22 at 20:38
@doesitmatter it doesn't matter. [It's actually slightly faster to use join with a list comprehension than a generator](https://stackoverflow.com/a/9061024/4909087) – cs95 Feb 12 '22 at 20:40
i prefer join( over that ugly join([ anytime sorry – Feb 12 '22 at 20:41
To each their own. Besides I'd leave that up to the OP as to what to use, this is my personally preferred method. – cs95 Feb 12 '22 at 20:42
I am having some troubles with NLTK library so I can't really test it properly. Thank you very much anyway – Jan Marczak Feb 12 '22 at 20:52
@JanMarczak what are the "some troubles"? – cs95 Feb 12 '22 at 21:16

score 0 · Accepted Answer · 2022-02-13T17:42:10.387

0

lets do it with regex to have more problems

df.text = df.text.str.replace(r"(?<=[.!?])[^.!?]*:\s*$", "", regex=True)

now df.text.tolist() is

['Trump met with Putin.',
 'New movie by Christopher Nolan!',
 'Campers: Get ready to stop COVID-19 in its tracks!',
 'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.',
 "I don't want to do a national lockdown again. If #coronavirus continues to 'progress' in the UK."]

variable lookbehind ftw

On regex:

(?<=[.!?])

This is a "lookbehind". It doesnt physically match anything but asserts something, which is that there must be something before what follows this. That something happens to be a character class here [.!?] which means either . or ! or ?.

[^.!?]*

Again we have a character class with square brackets. But now we have a caret ^ as the first which means that we want everything except those in the character class. So any character other than . or ! or ? will do.

The * after the character class is 0-or-more quantifier. Meaning, the "any character but .?!" can be found as many times as possible.

So far, we start matching either . or ? or !, and this character is behind a stream of characters which could be "anything but .?!". So we assured we match after the last sentence with this "anything but" because it can't match .?! on the way anymore.

:\s*$

With :, we say that the 0-or-more stream above is to stop whenever it sees : (if ever; if not, no replacement happens as desired).

The \s* after it is to allow some possible (again, 0 or more due to *) spaces (\s means space) after the :. You can remove that if you are certain there shall not be any space after :.

Lastly we have $: this matches the end of string (nothing physical, but positional). So we are sure that the string ends with : followed optionally by some spaces.

edited Feb 13 '22 at 17:42

answered Feb 12 '22 at 20:33

I updated the question to keep in mind that the sentence doesn't need to end on a full period all the time. This works for periods only – Jan Marczak Feb 12 '22 at 20:43
i put \s* to consider possible spaces after : at the very end – Feb 12 '22 at 20:49
This works as intended, thank you. – Jan Marczak Feb 12 '22 at 20:53
1

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 12 '22 at 22:22
After some testing it seems like this method doesn't delete 1 sentence but multiple sentences. String like this: "I don't want to do a national lockdown again. If #coronavirus continues to 'progress' in the UK. Read more here:", is replaced by: "I don't want to do a national lockdown again.", whereas it should keep the 2nd sentence – Jan Marczak Feb 13 '22 at 17:01
@JanMarczak ok I made an edit. Now it will replace after the last punctuation character which is . or ? or !. maybe it works – Feb 13 '22 at 17:10
Seems to work for now, thanks. Would you mind explaining a bit the regex you wrote? – Jan Marczak Feb 13 '22 at 17:20
1

@JanMarczak Sure. I tried to explain. By the way, I had put a link from regex101 from the beginning and thats why I didnt detail the regex as that website has some explainations. I hope together with those it is somewhat more clear now what regex is doing – Feb 13 '22 at 17:43
Yes, that helps. Thank you very much – Jan Marczak Feb 15 '22 at 15:26

Removing a sentence from a text in dataframe column

2 Answers2

On regex: