I have a large list of file names that I need to extract information from using R. The info is delimited by multiple dashes and underscores. I am having trouble figuring out a method that will accommodate the fact that the number of characters between delimiters is not consistent (the order of the information will remain constant, as will the delimiters used (hopefully)).
For example:
f <- data.frame(c("EI-SM4-AMW11_20160614_082800.wav", "PA-RF-A50_20160614_082800.wav"), stringsAsFactors = FALSE)
colnames(f)<-"filename"
f$area <- str_sub(f$filename, 1, 2)
f$rec <- str_sub(f$filename, 4, 6)
f$site <- str_sub(f$filename, 8, 12)
This produces correct results for the first file, but incorrect results for the second.
I've tried using the "stringr" and "stringi" packages, and know that hard coding the values in doesn't work, so I've come up with awkward solutions using both packages such as:
f$site <- str_sub(f$filename,
stri_locate_last(f$filename, fixed="-")[,1]+1,
stri_locate_first(f$filename, fixed="_")[,1]-1)
I feel like there must be a more elegant (and robust) method, perhaps involving regex (which I am painfully new to).
I've looked at other examples (Extract part of string (till the first semicolon) in R, R: Find the last dot in a string, Split string using regular expressions and store it into data frame).
Any suggestions/pointers would be very much appreciated.