I am in the process of creating a corpus of textbooks. Along with the actual sentences, there are some metadata columns, including the type of text the sentence is (for example, it is from the main body of texts, a text box, a figure, a table, or an activity).
Because of how the original text was tagged after it was scanned, I can easily mark where the different text types start and end, but I need to fill in the information for the sentences inside those two tags.
For example, I have (I have replaced the actual sentences with the word Sentences so that it fits the page):
chpt | page_num | paragraph | type | text_type | text |
---|---|---|---|---|---|
2 | 9 | 11 | main | text | Sentences |
2 | 9 | 12 | main | text | Sentences |
2 | 9 | 14 | text_box_start | header | Sentences |
2 | 9 | 15 | main | text | Sentences |
2 | 9 | 16 | main | text | Sentences |
2 | 9 | 17 | text_box_end | text | Sentences |
2 | 10 | 19 | main | text | Sentences |
2 | 10 | 20 | main | text | Sentences |
I want (the only thing that changes are the data in rows 3 to 6 of the type column):
chpt | page_num | paragraph | type | text_type | text |
---|---|---|---|---|---|
2 | 9 | 11 | main | text | Sentences |
2 | 9 | 12 | main | text | Sentences |
2 | 9 | 14 | text_box | header | Sentences |
2 | 9 | 15 | text_box | text | Sentences |
2 | 9 | 16 | text_box | text | Sentences |
2 | 9 | 17 | text_box | text | Sentences |
2 | 10 | 19 | main | text | Sentences |
2 | 10 | 20 | main | text | Sentences |
I imagine it would be possible to use a for and a couple of if/then loops to iterate over the "type" row, but I was wondering if there is an easier way to replace the value of all the rows between "text_box_start" and "text_box_end" in the df above with "text_box".
I am using R with the Tidyverse packages installed, so if anyone has a suggestion for a solution using base R or one of the Tidyverse packages, that would be greatly appreciated.