2

I am in the process of creating a corpus of textbooks. Along with the actual sentences, there are some metadata columns, including the type of text the sentence is (for example, it is from the main body of texts, a text box, a figure, a table, or an activity).

Because of how the original text was tagged after it was scanned, I can easily mark where the different text types start and end, but I need to fill in the information for the sentences inside those two tags.

For example, I have (I have replaced the actual sentences with the word Sentences so that it fits the page):

chpt page_num paragraph type text_type text
2 9 11 main text Sentences
2 9 12 main text Sentences
2 9 14 text_box_start header Sentences
2 9 15 main text Sentences
2 9 16 main text Sentences
2 9 17 text_box_end text Sentences
2 10 19 main text Sentences
2 10 20 main text Sentences

I want (the only thing that changes are the data in rows 3 to 6 of the type column):

chpt page_num paragraph type text_type text
2 9 11 main text Sentences
2 9 12 main text Sentences
2 9 14 text_box header Sentences
2 9 15 text_box text Sentences
2 9 16 text_box text Sentences
2 9 17 text_box text Sentences
2 10 19 main text Sentences
2 10 20 main text Sentences

I imagine it would be possible to use a for and a couple of if/then loops to iterate over the "type" row, but I was wondering if there is an easier way to replace the value of all the rows between "text_box_start" and "text_box_end" in the df above with "text_box".

I am using R with the Tidyverse packages installed, so if anyone has a suggestion for a solution using base R or one of the Tidyverse packages, that would be greatly appreciated.

Gavin
  • 35
  • 4

3 Answers3

1
library(tidyverse)

df %>%
  mutate(type1 = na_if(type, 'main')) %>%
  fill(type1) %>%
  mutate(type1 = coalesce(na_if(type1, 'text_box_end'), type),
         type1 = recode(type1, text_box_end = 'text_box_start'))

 chpt page_num paragraph           type text_type      text          type1
1     2        9        11           main      text Sentences           main
2     2        9        12           main      text Sentences           main
3     2        9        14 text_box_start    header Sentences text_box_start
4     2        9        15           main      text Sentences text_box_start
5     2        9        16           main      text Sentences text_box_start
6     2        9        17   text_box_end      text Sentences text_box_start
7     2       10        19           main      text Sentences           main
8     2       10        20           main      text Sentences           main
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Thanks for the suggestion. This works really well with the text_box tags, but was harder than the other solution to scale to include tags for other types of text. Thanks for taking the time to suggest this. – Gavin Aug 24 '22 at 07:53
1

One option is to change type to "textbox" if it contains "start" or "end" and cumulatively count the number of starts/ends and if it's an odd number (i.e. it lies between a 'start' and an 'end') change it to "textbox", i.e.

library(tidyverse)

df <- read.table(text = "chpt   page_num    paragraph   type    text_type   text
2   9   11  main    text    Sentences
2   9   12  main    text    Sentences
2   9   14  text_box_start  header  Sentences
2   9   15  main    text    Sentences
2   9   16  main    text    Sentences
2   9   17  text_box_end    text    Sentences
2   10  19  main    text    Sentences
2   10  20  main    text    Sentences",
header = TRUE)

# Tidyverse
df %>%
  mutate(type = ifelse(str_detect(type, "start|end") |
                         cumsum(str_detect(type, "start|end")) %% 2 == 1,
                       "textbox", type))
#>   chpt page_num paragraph    type text_type      text
#> 1    2        9        11    main      text Sentences
#> 2    2        9        12    main      text Sentences
#> 3    2        9        14 textbox    header Sentences
#> 4    2        9        15 textbox      text Sentences
#> 5    2        9        16 textbox      text Sentences
#> 6    2        9        17 textbox      text Sentences
#> 7    2       10        19    main      text Sentences
#> 8    2       10        20    main      text Sentences

# Base r
df$type <- ifelse(grepl("start|end", df$type) |
                    cumsum(grepl("start|end", df$type)) %% 2 == 1,
                  "textbox", df$type)
df
#>   chpt page_num paragraph    type text_type      text
#> 1    2        9        11    main      text Sentences
#> 2    2        9        12    main      text Sentences
#> 3    2        9        14 textbox    header Sentences
#> 4    2        9        15 textbox      text Sentences
#> 5    2        9        16 textbox      text Sentences
#> 6    2        9        17 textbox      text Sentences
#> 7    2       10        19    main      text Sentences
#> 8    2       10        20    main      text Sentences

Created on 2022-08-24 by the reprex package (v2.0.1)

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • 1
    Thanks, this solution worked perfectly. I liked the out-of-the-box thinking; I hadn't considered solving it this way. – Gavin Aug 24 '22 at 07:39
1

Assuming, all text boxes have start and end (as it's usually in valid HTML) you can grep for 'text_box' and use a 2-column matrix which will give you the edges of respective row sequences to change to 'text_box'.

dat[apply(matrix(grep('text_box', dat$type), 2), 2, \(x) do.call(seq, as.list(x))), 'type'] <- 'text_box'
dat
#    chpt page_num paragraph     type text_type      text
# 1     2        9        11     main      text Sentences
# 2     2        9        12     main      text Sentences
# 3     2        9        14 text_box    header Sentences
# 4     2        9        15 text_box      text Sentences
# 5     2        9        16 text_box      text Sentences
# 6     2        9        17 text_box      text Sentences
# 7     2       10        19     main      text Sentences
# 8     2       10        20     main      text Sentences
# 9     2        9        11     main      text Sentences
# 10    2        9        12     main      text Sentences
# 11    2        9        14 text_box    header Sentences
# 12    2        9        15 text_box      text Sentences
# 13    2        9        16 text_box      text Sentences
# 14    2        9        17 text_box      text Sentences
# 15    2       10        19     main      text Sentences
# 16    2       10        20     main      text Sentences

For demonstration I rbinded twice your sample data.


Data:

dat <- structure(list(chpt = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), page_num = c(9L, 9L, 9L, 9L, 9L, 9L, 
10L, 10L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 10L), paragraph = c(11L, 
12L, 14L, 15L, 16L, 17L, 19L, 20L, 11L, 12L, 14L, 15L, 16L, 17L, 
19L, 20L), type = c("main", "main", "text_box_start", "main", 
"main", "text_box_end", "main", "main", "main", "main", "text_box_start", 
"main", "main", "text_box_end", "main", "main"), text_type = c("text", 
"text", "header", "text", "text", "text", "text", "text", "text", 
"text", "header", "text", "text", "text", "text", "text"), text = c("Sentences", 
"Sentences", "Sentences", "Sentences", "Sentences", "Sentences", 
"Sentences", "Sentences", "Sentences", "Sentences", "Sentences", 
"Sentences", "Sentences", "Sentences", "Sentences", "Sentences"
)), class = "data.frame", row.names = c(NA, -16L))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thanks for the suggestion. For some reason, it threw an error when I tried it. Error in `[<-.data.frame`(`*tmp*`, apply(matrix(grep("text_box", dat$type), : 'list' object cannot be coerced to type 'integer' I don't have much experience using matrices in this way, so it was probably something that I did. I was able to get the solution below to work. Thanks for taking the time to answer this and I will look at it in more detail when I have time to see if I can figure out what I did wrong. – Gavin Aug 24 '22 at 07:41
  • @Gavin I did not have your data available in the [recommended way](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) using `dput()`. I have added the data as I have it so you can compare. – jay.sf Aug 24 '22 at 08:01