Regex doesn't capture numbers written out as words

Question

I'm looking at Oliver Twist in both English and French. I found this site (https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html) that provides code to apply the chapter number per row of text. When I apply it to the English text, it works just fine:

library(gutenbergr)
library(dplyr)
library(tidytext)
library(stringr)
twistEN <- gutenberg_download(730)
twistEN <- twistEN[118:nrow(twistEN),]
chaptersEN <- twistEN %>%
  mutate(line = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
  ungroup()

When I then look at chaptersEN, I can see that it's appropriately applied the chapter number on each row. Where I'm running into trouble is with the French text. Here's my code:

twistFR <- gutenberg_download(16023)
twistFR <- twistFR[123:nrow(twistFR),]
twistFR$text <- iconv(twistFR$text, "latin1", "UTF-8")
chaptersFR <- twistFR %>%
  mutate(line = row_number(), chapter = cumsum(str_detect(text, regex("^chaptitre [\\divxlc]", ignore_case = TRUE)))) %>%
  ungroup()

The problem here is that the chapters aren't named Chapter 1 and Chapter 2, they are named Chapitre Premier, Chapitre Deuxieme. I believe the regex is finding the chapter number by looking at the numeral following the word chapter (please correct me if I'm wrong), so it doesn't know what to do when that numeral is written in as a word. Any ideas on how to apply the chapter number?

Yeah, that's a tough one. Other than enumerating the number words and mapping those to actual numbers, you're kind of out of luck. You could, however, just assume chapter number based on an incrementing index, basically side-step the regex for _that_ piece of info entirely. — Jordan Kasper, Sep 23 '19 at 19:35

score 0 · Answer 1 · answered Sep 23 '19 at 19:59

The short answer: you wrote chaptitre instead of chapitre

For what are you using the [\\divxlc] part in the code?
For example: ^chapitre [\\divxlc]
^ means at the start of a row
chapitre matches just the word chapitre(only lowercase)
the blank field matches the space
and the part [\\divxlc] matches only '\', 'd', 'i','v','x','l' or 'c'

So it could match these examples: chapitre d, chapitre i, or chapitre \

And if you want the c at the start of chapitre to be uppercase or lowercase you could use this:
^[cC]hapitre [\\divxlc]

To be honest, I copied the regex from another site. The only thing I can think of is that it intends to look for roman numerals? — Litmon, Sep 24 '19 at 22:36

Ritchie Sacramento · Accepted Answer · 2019-09-23T20:11:51.093

Matching on rows that begin with an upper case 'CHAPITRE' is sufficient in this case.

chaptersFR <- twistFR %>%
  mutate(line = row_number(), chapter = cumsum(str_detect(text, regex("^CHAPITRE")))) %>%
  ungroup()

chaptersFR %>% 
  filter(grepl("^chapitre", text, ignore.case = TRUE)) %>%
  head(5)

# A tibble: 5 x 4
  gutenberg_id text               line chapter
         <int> <chr>             <int>   <int>
1        16023 CHAPITRE PREMIER.     1       1
2        16023 CHAPITRE II         124       2
3        16023 CHAPITRE III        604       3
4        16023 CHAPITRE IV.       1006       4
5        16023 CHAPITRE V.        1333       5

chaptersFR %>% 
  filter(grepl("^chapitre", text, ignore.case = TRUE)) %>%
  tail(5)

# A tibble: 5 x 4
  gutenberg_id text                                                            line chapter
         <int> <chr>                                                          <int>   <int>
1        16023 CHAPITRE L.                                                    18443      50
2        16023 CHAPITRE LI.                                                   18973      51
3        16023 chapitre, Olivier se trouvait, à trois heures de l'après-midi, 18979      51
4        16023 CHAPITRE LII                                                   19580      52
5        16023 CHAPITRE LIII.                                                 19989      53

Regex doesn't capture numbers written out as words

2 Answers2