Separating a column based on regex

Question

I have a column, named as "document" which has the following structure

1994_post_elections_Mandela.txt
1994_pre_elections_deKlerk.txt
1995_Mandela.txt
1996_Mandela.txt
1997_Mandela.txt
1998_Mandela.txt
1999_post_elections_Mandela.txt
1999_pre_elections_Mandela.txt
2000_Mbeki.txt

What I'd like to do is extract the president's name, which is always just before ".txt" and pop it into a new column - I don't mind about the other characters/numbers going into another column altogether. For various reasons that I won't explain here, I need to use the separate function from the tidyr package.

I tried to follow the answer from here but my attempt failed miserably...

speech_gamma_exp<-speech_gamma %>%
separate(document, into=c("col1", "col2"), sep = "(\\_)(?!_*\\_)")

Question with `tidyr::separate` [Separate string after last underscore](https://stackoverflow.com/questions/49900323/separate-string-after-last-underscore) — pogibas, Sep 10 '18 at 19:42
Sorry, yes, I just automatically assumed it was from dplyr due to the >%> operator, but I have corrected myself. — The Statistician Magician, Sep 10 '18 at 19:43
I think you need `extract(document, into=c("col1", "col2"), "^(.*)_(.*)\\.txt$")`. With `separate`, you will always have the `.txt`s inside the strings. — Wiktor Stribiżew, Sep 10 '18 at 20:33

score 1 · Answer 1 · answered Sep 10 '18 at 20:22

We cab use R base gsub:

> df1$President <- gsub(".*_(\\w+)\\.txt$", "\\1", df1$V1)
> df1
                               V1 President
1 1994_post_elections_Mandela.txt   Mandela
2  1994_pre_elections_deKlerk.txt   deKlerk
3                1995_Mandela.txt   Mandela
4                1996_Mandela.txt   Mandela
5                1997_Mandela.txt   Mandela
6                1998_Mandela.txt   Mandela
7 1999_post_elections_Mandela.txt   Mandela
8  1999_pre_elections_Mandela.txt   Mandela
9                  2000_Mbeki.txt     Mbeki

Assume your data.frame is:

df1 <- read.table(text="1994_post_elections_Mandela.txt
1994_pre_elections_deKlerk.txt
1995_Mandela.txt
1996_Mandela.txt
1997_Mandela.txt
1998_Mandela.txt
1999_post_elections_Mandela.txt
1999_pre_elections_Mandela.txt
2000_Mbeki.txt", header=FALSE, stringsAsFactors=FALSE)

score 1 · Answer 2 · answered Sep 10 '18 at 20:37

Since you say that you must use separate, here is a way. We can use str_count to get the maximum number of splits with _ separator, and then make our into argument for separate based on that. Combined with fill = "left", this means that we know the last split (the president.txt) will be in the last column. You can then remove .txt and the other columns as needed.

However, I think it is much simpler to just directly mutate the president name into a column with str_extract, as in the second example. This uses lookarounds to match letters preceded by _ and followed by .txt.

library(tidyverse)
tbl <- tibble(
  document = c(
    "1994_post_elections_Mandela.txt",
    "1994_pre_elections_deKlerk.txt",
    "1995_Mandela.txt",
    "1996_Mandela.txt",
    "1997_Mandela.txt",
    "1998_Mandela.txt",
    "1999_post_elections_Mandela.txt",
    "1999_pre_elections_Mandela.txt",
    "2000_Mbeki.txt"
  )
)

tbl %>%
  separate(
    col = document,
    into = str_c(
      "col",
      1:(as.integer(max(str_count(.$document, "_"))) + 1)
    ),
    sep = "_",
    fill = "left"
  )
#> # A tibble: 9 x 4
#>   col1  col2  col3      col4       
#>   <chr> <chr> <chr>     <chr>      
#> 1 1994  post  elections Mandela.txt
#> 2 1994  pre   elections deKlerk.txt
#> 3 <NA>  <NA>  1995      Mandela.txt
#> 4 <NA>  <NA>  1996      Mandela.txt
#> 5 <NA>  <NA>  1997      Mandela.txt
#> 6 <NA>  <NA>  1998      Mandela.txt
#> 7 1999  post  elections Mandela.txt
#> 8 1999  pre   elections Mandela.txt
#> 9 <NA>  <NA>  2000      Mbeki.txt

tbl %>%
  mutate(president = str_extract(document, "(?<=_)[:alpha:]*?(?=\\.txt)"))
#> # A tibble: 9 x 2
#>   document                        president
#>   <chr>                           <chr>    
#> 1 1994_post_elections_Mandela.txt Mandela  
#> 2 1994_pre_elections_deKlerk.txt  deKlerk  
#> 3 1995_Mandela.txt                Mandela  
#> 4 1996_Mandela.txt                Mandela  
#> 5 1997_Mandela.txt                Mandela  
#> 6 1998_Mandela.txt                Mandela  
#> 7 1999_post_elections_Mandela.txt Mandela  
#> 8 1999_pre_elections_Mandela.txt  Mandela  
#> 9 2000_Mbeki.txt                  Mbeki

Created on 2018-09-10 by the reprex package (v0.2.0).

Chaos · Answer 3 · 2018-09-10T20:48:38.580

This is quite straightforward using gsub or stringr/stringi. I could only come up with a tidyr::separate based solution after some jumping through hoops:

#### Create Data ####
pres_vector <- c("1994_post_elections_Mandela.txt", "1994_pre_elections_deKlerk.txt",
     "1995_Mandela.txt", "1996_Mandela.txt", "1997_Mandela.txt", "1998_Mandela.txt",
     "1999_post_elections_Mandela.txt", "1999_pre_elections_Mandela.txt", "2000_Mbeki.txt")

#### Libraries ####
library(stringi)
library(tidyr)

#### Solution ####    
pres_vector %>% stri_reverse %>% data.frame(x = .) %>% 
    separate(x, c("file_ext", "pres")) %>% { .[["pres"]] } %>% stri_reverse -> pres_names

print(pres_names)
[1] "Mandela" "deKlerk" "Mandela" "Mandela" "Mandela" "Mandela" "Mandela" "Mandela" "Mbeki"

This works because of the pattern of the strings. Separate will split on alphanumeric characters by default. The last part of the string is the file extension, and the 2nd last part of the string is the President's name.

Thus, reversing the string puts the (reversed) file extension first and (reversed) President's name second. Separate allows us to extract these first 2 parts and subset to keep only the President's name. And finally, reversing this substring (the President's reversed name) gives us the President's name (without reversal).

score 0 · Answer 4 · answered Sep 10 '18 at 20:44

0

I prefer to use stringr for this kind of tasks (gsub is fine too)

library(stringr)
pattern <- ".*_(\\w+)\\.txt$"    
data$president <- str_extract(data$document, "(?<=_)[^_]+(?=\\.txt)")

Regex Demo

answered Sep 10 '18 at 20:44

wp78de

18,207
7
43
71

Separating a column based on regex

4 Answers4