I'm looking at Oliver Twist in both English and French. I found this site (https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html) that provides code to apply the chapter number per row of text. When I apply it to the English text, it works just fine:
library(gutenbergr)
library(dplyr)
library(tidytext)
library(stringr)
twistEN <- gutenberg_download(730)
twistEN <- twistEN[118:nrow(twistEN),]
chaptersEN <- twistEN %>%
mutate(line = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup()
When I then look at chaptersEN, I can see that it's appropriately applied the chapter number on each row. Where I'm running into trouble is with the French text. Here's my code:
twistFR <- gutenberg_download(16023)
twistFR <- twistFR[123:nrow(twistFR),]
twistFR$text <- iconv(twistFR$text, "latin1", "UTF-8")
chaptersFR <- twistFR %>%
mutate(line = row_number(), chapter = cumsum(str_detect(text, regex("^chaptitre [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup()
The problem here is that the chapters aren't named Chapter 1 and Chapter 2, they are named Chapitre Premier, Chapitre Deuxieme. I believe the regex is finding the chapter number by looking at the numeral following the word chapter (please correct me if I'm wrong), so it doesn't know what to do when that numeral is written in as a word. Any ideas on how to apply the chapter number?