grep or pmatch?

Question

I am trying to import a series of files from directory and convert each of them into a dataframe. I would also like to use the file title to create two new columns with title-dependent values. Input files have the format: xx_yy.out Where XX can currently be one of three values. YY currently has two possible values. In the future these numbers will go up.

Edit of Solution based on the comments (see below for the original question)

edited again to reflect suggestions of @JoshO'Brien

filelist <- as.list(dir(pattern = ".*.out"))

for(i in filelist) {

    tempdata  <- read.table(i)                  #read the table
    filelistshort <- gsub(".out$", "", i)       #remove the end of the file
    tempsplit <- strsplit(filelistshort, "_")   #remove the underscore
    xx <- sapply(tempsplit, "[", 1)             #get xx
    yy <- sapply(tempsplit, "[", 2)             #get yy
    tempdata$XX <- xx                           #add XX column
    tempdata$YY <- yy                           #add YY column
    assign(gsub(".out","",i), tempdata)         # give the dataframe a shortened name

}

Below is the original code showing that I wanted to use some means to ge teh XX and YY values but wasn't sure of the best way:

My outline (after @romanlustrik post ) is as follows:

filelist <- as.list(dir(pattern = ".*.out"))
lapply(filelist, FUN = function(x) {
    xx <- grep() or pmatch()
    yy <- grep() or pmatch()
    x <- data.frame(read.table(x)) 
    x$colx <- xx
    x$coly <- yy
    return(x)
})

where the xx <- and yy <- lines would be a lookup based on either pmatch or grep. I am playing around to make either one work but would welcome any suggestions.

score 2 · Accepted Answer · answered Oct 16 '11 at 20:46

2

If we can assume that your file names will contain only a single "_", I wouldn't use grep() or pmatch() at all.

strsplit() seems to provide a cleaner and simpler solution:

filelist <- c("aa_mm.out", "bb_mm.out", "cc_nn.out")

# Remove the trailing ".out"
rootNames <- gsub(".out$", "", filelist)

# Split string at the "_"
rootParts <- strsplit(rootNames, "_")

# Extract the first and second parts into character vectors
xx <- sapply(rootParts, "[", 1)
yy <- sapply(rootParts, "[", 2)

xx
# [1] "aa" "bb" "cc"
yy
# [1] "mm" "mm" "nn"

answered Oct 16 '11 at 20:46

Josh O'Brien

159,210
26
366
455

@zach -- No problem. For compactness and reliability, you might want to put all of your calculations into a single `for()` loop or call to `lapply()`. If you go the `lapply()` route, you'll need to ensure that assignment takes place into the global environment, by specifying `assign("objectName", object, envir=.GlobalEnv)`. – Josh O'Brien Oct 16 '11 at 22:49

score 0 · Answer 2 · answered Oct 16 '11 at 20:40

This is an ugly hack, but gets the job done.

fl <- c("12_34.out", "ab_23.out", "02_rk.out")
xx <- regexpr(pattern = ".._", text = fl)
XX <- (substr(fl, start = xx, stop = xx + attr(xx, "match.length")-1))
  [1] "12" "ab" "02"
yy <- regexpr(pattern = "_..", text = fl)
YY <- (substr(fl, start = yy + 1, stop = yy + attr(yy, "match.length")-1))
  [1] "34" "23" "rk"

grep or pmatch?

2 Answers2