To help you, first I need to prepare some data.
library(tidyverse)
articles = read.table(
header = TRUE,sep = ",",text="
content,publisher,date
======,NA,NA
Daily News Dec 27,Daily News,Dec 27
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily News Dec 28,Daily News,Dec 28
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily News Dec 30,Daily News,Dec 30
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily Bugle Dec 31,Daily Bugle,Dec 31
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Weekly News Dec 31,Weekly News,Dec 31
Wookies are,NA,NA
. recreationally,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
======,NA,NA") %>%
as_tibble() %>%
mutate(publisher = ifelse(publisher=="NA", NA, publisher),
date = ifelse(date=="NA", NA, date))
articles
output
# A tibble: 52 x 3
content publisher date
<chr> <chr> <chr>
1 ====== NA NA
2 Daily News Dec 27 Daily News Dec 27
3 Wookies are NA NA
4 . recreationally NA NA
5 using drugs at a NA NA
6 higher rate than NA NA
7 using drugs at a NA NA
8 higher rate than NA NA
9 using drugs at a NA NA
10 higher rate than NA NA
# ... with 42 more rows
I hope this is what your data format is. For me, these are five articles.
Now let's add one convert function and a simple mutation.
fConvert = function(data) tibble(
publisher = data$publisher[2],
date = data$date[2],
content = data %>% slice(3:(nrow(.)-1)) %>%
pull(content) %>% paste(collapse = " ")
)
articles %>% mutate(
idArticle = ifelse(!is.na(publisher),1, 0) %>%
cumsum() %>% lead(default=.[length(.)])
) %>% group_by(idArticle) %>%
nest() %>%
group_modify(~fConvert(.x$data[[1]]))
output
# A tibble: 5 x 4
# Groups: idArticle [5]
idArticle publisher date content
<dbl> <chr> <chr> <chr>
1 1 Daily News Dec 27 Wookies are . recreationally using drugs at a higher rate than using drugs at a higher rate than u~
2 2 Daily News Dec 28 Wookies are . recreationally using drugs at a higher rate than ever before. ever before. ever befo~
3 3 Daily News Dec 30 Wookies are . recreationally using drugs at a higher rate than ever before. ever before.
4 4 Daily Bugle Dec 31 Wookies are . recreationally using drugs at a higher rate than ever before.
5 5 Weekly News Dec 31 Wookies are . recreationally higher rate than ever before.
As you can see, I was able to extract five articles, despite their different lengths, and glue all the lines together into one content
. Hope that's what you meant.