Since you say that you must use separate
, here is a way. We can use str_count
to get the maximum number of splits with _
separator, and then make our into
argument for separate
based on that. Combined with fill = "left"
, this means that we know the last split (the president.txt) will be in the last column. You can then remove .txt
and the other columns as needed.
However, I think it is much simpler to just directly mutate
the president name into a column with str_extract
, as in the second example. This uses lookarounds to match letters preceded by _
and followed by .txt
.
library(tidyverse)
tbl <- tibble(
document = c(
"1994_post_elections_Mandela.txt",
"1994_pre_elections_deKlerk.txt",
"1995_Mandela.txt",
"1996_Mandela.txt",
"1997_Mandela.txt",
"1998_Mandela.txt",
"1999_post_elections_Mandela.txt",
"1999_pre_elections_Mandela.txt",
"2000_Mbeki.txt"
)
)
tbl %>%
separate(
col = document,
into = str_c(
"col",
1:(as.integer(max(str_count(.$document, "_"))) + 1)
),
sep = "_",
fill = "left"
)
#> # A tibble: 9 x 4
#> col1 col2 col3 col4
#> <chr> <chr> <chr> <chr>
#> 1 1994 post elections Mandela.txt
#> 2 1994 pre elections deKlerk.txt
#> 3 <NA> <NA> 1995 Mandela.txt
#> 4 <NA> <NA> 1996 Mandela.txt
#> 5 <NA> <NA> 1997 Mandela.txt
#> 6 <NA> <NA> 1998 Mandela.txt
#> 7 1999 post elections Mandela.txt
#> 8 1999 pre elections Mandela.txt
#> 9 <NA> <NA> 2000 Mbeki.txt
tbl %>%
mutate(president = str_extract(document, "(?<=_)[:alpha:]*?(?=\\.txt)"))
#> # A tibble: 9 x 2
#> document president
#> <chr> <chr>
#> 1 1994_post_elections_Mandela.txt Mandela
#> 2 1994_pre_elections_deKlerk.txt deKlerk
#> 3 1995_Mandela.txt Mandela
#> 4 1996_Mandela.txt Mandela
#> 5 1997_Mandela.txt Mandela
#> 6 1998_Mandela.txt Mandela
#> 7 1999_post_elections_Mandela.txt Mandela
#> 8 1999_pre_elections_Mandela.txt Mandela
#> 9 2000_Mbeki.txt Mbeki
Created on 2018-09-10 by the reprex package (v0.2.0).