Is there an easy way to fill in between 2 values in R?

Question

I am in the process of creating a corpus of textbooks. Along with the actual sentences, there are some metadata columns, including the type of text the sentence is (for example, it is from the main body of texts, a text box, a figure, a table, or an activity).

Because of how the original text was tagged after it was scanned, I can easily mark where the different text types start and end, but I need to fill in the information for the sentences inside those two tags.

For example, I have (I have replaced the actual sentences with the word Sentences so that it fits the page):

chpt	page_num	paragraph	type	text_type	text
2	9	11	main	text	Sentences
2	9	12	main	text	Sentences
2	9	14	text_box_start	header	Sentences
2	9	15	main	text	Sentences
2	9	16	main	text	Sentences
2	9	17	text_box_end	text	Sentences
2	10	19	main	text	Sentences
2	10	20	main	text	Sentences

I want (the only thing that changes are the data in rows 3 to 6 of the type column):

chpt	page_num	paragraph	type	text_type	text
2	9	11	main	text	Sentences
2	9	12	main	text	Sentences
2	9	14	text_box	header	Sentences
2	9	15	text_box	text	Sentences
2	9	16	text_box	text	Sentences
2	9	17	text_box	text	Sentences
2	10	19	main	text	Sentences
2	10	20	main	text	Sentences

I imagine it would be possible to use a for and a couple of if/then loops to iterate over the "type" row, but I was wondering if there is an easier way to replace the value of all the rows between "text_box_start" and "text_box_end" in the df above with "text_box".

I am using R with the Tidyverse packages installed, so if anyone has a suggestion for a solution using base R or one of the Tidyverse packages, that would be greatly appreciated.

score 1 · Answer 1 · answered Aug 24 '22 at 04:37

library(tidyverse)

df %>%
  mutate(type1 = na_if(type, 'main')) %>%
  fill(type1) %>%
  mutate(type1 = coalesce(na_if(type1, 'text_box_end'), type),
         type1 = recode(type1, text_box_end = 'text_box_start'))

 chpt page_num paragraph           type text_type      text          type1
1     2        9        11           main      text Sentences           main
2     2        9        12           main      text Sentences           main
3     2        9        14 text_box_start    header Sentences text_box_start
4     2        9        15           main      text Sentences text_box_start
5     2        9        16           main      text Sentences text_box_start
6     2        9        17   text_box_end      text Sentences text_box_start
7     2       10        19           main      text Sentences           main
8     2       10        20           main      text Sentences           main

Thanks for the suggestion. This works really well with the text_box tags, but was harder than the other solution to scale to include tags for other types of text. Thanks for taking the time to suggest this. — Gavin, Aug 24 '22 at 07:53

jared_mamrot · Accepted Answer · 2022-08-24T05:32:13.373

One option is to change type to "textbox" if it contains "start" or "end" and cumulatively count the number of starts/ends and if it's an odd number (i.e. it lies between a 'start' and an 'end') change it to "textbox", i.e.

library(tidyverse)

df <- read.table(text = "chpt   page_num    paragraph   type    text_type   text
2   9   11  main    text    Sentences
2   9   12  main    text    Sentences
2   9   14  text_box_start  header  Sentences
2   9   15  main    text    Sentences
2   9   16  main    text    Sentences
2   9   17  text_box_end    text    Sentences
2   10  19  main    text    Sentences
2   10  20  main    text    Sentences",
header = TRUE)

# Tidyverse
df %>%
  mutate(type = ifelse(str_detect(type, "start|end") |
                         cumsum(str_detect(type, "start|end")) %% 2 == 1,
                       "textbox", type))
#>   chpt page_num paragraph    type text_type      text
#> 1    2        9        11    main      text Sentences
#> 2    2        9        12    main      text Sentences
#> 3    2        9        14 textbox    header Sentences
#> 4    2        9        15 textbox      text Sentences
#> 5    2        9        16 textbox      text Sentences
#> 6    2        9        17 textbox      text Sentences
#> 7    2       10        19    main      text Sentences
#> 8    2       10        20    main      text Sentences

# Base r
df$type <- ifelse(grepl("start|end", df$type) |
                    cumsum(grepl("start|end", df$type)) %% 2 == 1,
                  "textbox", df$type)
df
#>   chpt page_num paragraph    type text_type      text
#> 1    2        9        11    main      text Sentences
#> 2    2        9        12    main      text Sentences
#> 3    2        9        14 textbox    header Sentences
#> 4    2        9        15 textbox      text Sentences
#> 5    2        9        16 textbox      text Sentences
#> 6    2        9        17 textbox      text Sentences
#> 7    2       10        19    main      text Sentences
#> 8    2       10        20    main      text Sentences

^{Created on 2022-08-24 by the reprex package (v2.0.1)}

Thanks, this solution worked perfectly. I liked the out-of-the-box thinking; I hadn't considered solving it this way. — Gavin, Aug 24 '22 at 07:39

jay.sf · Answer 3 · 2022-08-24T07:58:27.797

Assuming, all text boxes have start and end (as it's usually in valid HTML) you can grep for 'text_box' and use a 2-column matrix which will give you the edges of respective row sequences to change to 'text_box'.

dat[apply(matrix(grep('text_box', dat$type), 2), 2, \(x) do.call(seq, as.list(x))), 'type'] <- 'text_box'
dat
#    chpt page_num paragraph     type text_type      text
# 1     2        9        11     main      text Sentences
# 2     2        9        12     main      text Sentences
# 3     2        9        14 text_box    header Sentences
# 4     2        9        15 text_box      text Sentences
# 5     2        9        16 text_box      text Sentences
# 6     2        9        17 text_box      text Sentences
# 7     2       10        19     main      text Sentences
# 8     2       10        20     main      text Sentences
# 9     2        9        11     main      text Sentences
# 10    2        9        12     main      text Sentences
# 11    2        9        14 text_box    header Sentences
# 12    2        9        15 text_box      text Sentences
# 13    2        9        16 text_box      text Sentences
# 14    2        9        17 text_box      text Sentences
# 15    2       10        19     main      text Sentences
# 16    2       10        20     main      text Sentences

For demonstration I rbinded twice your sample data.

Data:

dat <- structure(list(chpt = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), page_num = c(9L, 9L, 9L, 9L, 9L, 9L, 
10L, 10L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 10L), paragraph = c(11L, 
12L, 14L, 15L, 16L, 17L, 19L, 20L, 11L, 12L, 14L, 15L, 16L, 17L, 
19L, 20L), type = c("main", "main", "text_box_start", "main", 
"main", "text_box_end", "main", "main", "main", "main", "text_box_start", 
"main", "main", "text_box_end", "main", "main"), text_type = c("text", 
"text", "header", "text", "text", "text", "text", "text", "text", 
"text", "header", "text", "text", "text", "text", "text"), text = c("Sentences", 
"Sentences", "Sentences", "Sentences", "Sentences", "Sentences", 
"Sentences", "Sentences", "Sentences", "Sentences", "Sentences", 
"Sentences", "Sentences", "Sentences", "Sentences", "Sentences"
)), class = "data.frame", row.names = c(NA, -16L))

Thanks for the suggestion. For some reason, it threw an error when I tried it. Error in `[<-.data.frame`(`*tmp*`, apply(matrix(grep("text_box", dat$type), : 'list' object cannot be coerced to type 'integer' I don't have much experience using matrices in this way, so it was probably something that I did. I was able to get the solution below to work. Thanks for taking the time to answer this and I will look at it in more detail when I have time to see if I can figure out what I did wrong. — Gavin, Aug 24 '22 at 07:41
@Gavin I did not have your data available in the [recommended way](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) using `dput()`. I have added the data as I have it so you can compare. — jay.sf, Aug 24 '22 at 08:01

Is there an easy way to fill in between 2 values in R?

3 Answers3