Text mining in R: delete first sentence of each document

Question

I have several documents and do not need the first sentence of each document. I could not find a solution so far.

Here is an example. The structure of the data looks like this

case_number	text
1	Today is a good day. It is sunny.
2	Today is a bad day. It is rainy.

So the results should look like this

case_number	text
1	It is sunny.
2	It is rainy.

Here is the example dataset:

case_number <- c(1, 2)

text <- c("Today is a good day. It is sunny.",
          "Today is a bad day. It is rainy.")

data <- data.frame(case_number, text)

I suggest looking at some NLP related libraries in R. They ususally have preimplemented sentence splitting functionalities. Look here: https://stackoverflow.com/questions/18712878/r-break-corpus-into-sentences — Patrick, Aug 10 '23 at 07:11
If your sentence always ends with known punctuation marks, it is essentially "remove everything before the puncutations", similar to this one https://stackoverflow.com/questions/32767164/use-gsub-remove-all-string-before-first-white-space-in-r (i.e. to replace the white space in the question with your known punctuation marks) — benson23, Aug 10 '23 at 07:13
Thank's benson23. Your right. I can solve it by excluding everything before first punctuation. I was thinking a little bit more complex. That's a very easy way to handle it. — USER12345, Aug 10 '23 at 08:16

score 1 · Accepted Answer · answered Aug 10 '23 at 09:14

If there's a chance that sentences might include some punctuation (e.g. abbreviations or numerics), and you are using some text mining library anyway, it makes perfect sense to let it handle tokenization.

With {tidytext} :

library(dplyr)
library(tidytext)

# exmple with punctuation in 1st sentence
data <- data.frame(case_number = c(1, 2),
                   text = c("Today is a good day, above avg. for sure, by 5.1 points. It is sunny.",
                            "Today is a bad day. It is rainy."))
# tokenize to sentences, converting tokens to lowercase is optional
data %>% 
  unnest_sentences(s, text)
#>   case_number                                                        s
#> 1           1 today is a good day, above avg. for sure, by 5.1 points.
#> 2           1                                             it is sunny.
#> 3           2                                      today is a bad day.
#> 4           2                                             it is rainy.

# drop 1st record of every case_number group
data %>% 
  unnest_sentences(s, text) %>% 
  filter(row_number() > 1, .by = case_number)
#>   case_number            s
#> 1           1 it is sunny.
#> 2           2 it is rainy.

^{Created on 2023-08-10 with reprex v2.0.2}

Text mining in R: delete first sentence of each document

1 Answers1