1

I have a large list of file names that I need to extract information from using R. The info is delimited by multiple dashes and underscores. I am having trouble figuring out a method that will accommodate the fact that the number of characters between delimiters is not consistent (the order of the information will remain constant, as will the delimiters used (hopefully)).

For example:

 f <- data.frame(c("EI-SM4-AMW11_20160614_082800.wav", "PA-RF-A50_20160614_082800.wav"), stringsAsFactors = FALSE)
 colnames(f)<-"filename"
 f$area <- str_sub(f$filename, 1, 2)
 f$rec <- str_sub(f$filename, 4, 6)
 f$site <- str_sub(f$filename, 8, 12)

This produces correct results for the first file, but incorrect results for the second.

I've tried using the "stringr" and "stringi" packages, and know that hard coding the values in doesn't work, so I've come up with awkward solutions using both packages such as:

f$site <- str_sub(f$filename, 
                  stri_locate_last(f$filename, fixed="-")[,1]+1, 
                  stri_locate_first(f$filename, fixed="_")[,1]-1)

I feel like there must be a more elegant (and robust) method, perhaps involving regex (which I am painfully new to).

I've looked at other examples (Extract part of string (till the first semicolon) in R, R: Find the last dot in a string, Split string using regular expressions and store it into data frame).

Any suggestions/pointers would be very much appreciated.

Community
  • 1
  • 1
JMDR
  • 121
  • 1
  • 8
  • 1
    There no description in natural language that matched either of those efforts. You have two instances of dashes and two instances of underscores and you only want two or three items. _Describe_ what you what rather than presenting failing code. – IRTFM Sep 13 '16 at 22:12

2 Answers2

2

Try this, from the `tidyr' package:

library(tidyr)

f %>% separate(filename, c('area', 'rec', 'site'), sep = '-')

You can also split along multiple difference delimeters, like so:

f %>% separate(filename, c('area', 'rec', 'site', 'date', 'don_know_what_this_is', 'file_extension'), sep = '-|_|\\.')

and then keep only the columns you want using dplyr's select function:

 library(dplyr)
 library(tidyr)

 f %>% 
   separate(filename,
            c('area', 'rec', 'site', 'date',
              'don_know_what_this_is', 'file_extension'), 
            sep = '-|_|\\.') %>%
   select(area, rec, site)
RoyalTS
  • 9,545
  • 12
  • 60
  • 101
  • 1
    `separate` splits by any delimiter by default (note what happens if you don't define one in your second example), and drops any "extra" pieces beyond the number of new columns you define so you don't need to do a second `select` step. To keep the original column, see `remove = FALSE`. – aosmith Sep 13 '16 at 22:05
  • Thank you RoyalTS—I hadn't thought of using tidyr for this task, but this solution worked perfectly. I did use aosmith's suggestion to not define the delimiters, which makes the solution even more robust to people putting weird things in the file prefixes. (P.S. the last chunk before the file extension is the time in hhmmss) – JMDR Sep 14 '16 at 14:51
0

Something like this:

library(stringr)
library(dplyr)

f$area <- word(f$filename, 1, sep = "-")
f$rec <- word(f$filename, 2, sep = "-")
f$site <- word(f$filename, 3, sep = "-") %>%
        word(1,sep = "_")        

dplyr is not necessary but makes concatenation cleaner. The function word belongs to stringr.

thepule
  • 1,721
  • 1
  • 12
  • 22