0

I have several documents and do not need the first sentence of each document. I could not find a solution so far.

Here is an example. The structure of the data looks like this

case_number text
1 Today is a good day. It is sunny.
2 Today is a bad day. It is rainy.

So the results should look like this

case_number text
1 It is sunny.
2 It is rainy.

Here is the example dataset:

case_number <- c(1, 2)

text <- c("Today is a good day. It is sunny.",
          "Today is a bad day. It is rainy.")

data <- data.frame(case_number, text)
Patrick
  • 742
  • 7
  • 19
USER12345
  • 45
  • 5
  • 1
    I suggest looking at some NLP related libraries in R. They ususally have preimplemented sentence splitting functionalities. Look here: https://stackoverflow.com/questions/18712878/r-break-corpus-into-sentences – Patrick Aug 10 '23 at 07:11
  • 1
    If your sentence always ends with known punctuation marks, it is essentially "remove everything before the puncutations", similar to this one https://stackoverflow.com/questions/32767164/use-gsub-remove-all-string-before-first-white-space-in-r (i.e. to replace the white space in the question with your known punctuation marks) – benson23 Aug 10 '23 at 07:13
  • Thank's benson23. Your right. I can solve it by excluding everything before first punctuation. I was thinking a little bit more complex. That's a very easy way to handle it. – USER12345 Aug 10 '23 at 08:16

1 Answers1

1

If there's a chance that sentences might include some punctuation (e.g. abbreviations or numerics), and you are using some text mining library anyway, it makes perfect sense to let it handle tokenization.

With {tidytext} :

library(dplyr)
library(tidytext)

# exmple with punctuation in 1st sentence
data <- data.frame(case_number = c(1, 2),
                   text = c("Today is a good day, above avg. for sure, by 5.1 points. It is sunny.",
                            "Today is a bad day. It is rainy."))
# tokenize to sentences, converting tokens to lowercase is optional
data %>% 
  unnest_sentences(s, text)
#>   case_number                                                        s
#> 1           1 today is a good day, above avg. for sure, by 5.1 points.
#> 2           1                                             it is sunny.
#> 3           2                                      today is a bad day.
#> 4           2                                             it is rainy.

# drop 1st record of every case_number group
data %>% 
  unnest_sentences(s, text) %>% 
  filter(row_number() > 1, .by = case_number)
#>   case_number            s
#> 1           1 it is sunny.
#> 2           2 it is rainy.

Created on 2023-08-10 with reprex v2.0.2

margusl
  • 7,804
  • 2
  • 16
  • 20