7

There appear to be similar questions to this in other languages but I can't find one in R.

I have a number of text files in the subdirectories of a directory; they all have the extension (.log) and they contain a mixture of text and data. I want to extract a couple of lines from these relatively large files.

For example, one file goes as follows ...

blahblahblah

NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS =  210

blahblahblah

 ----------------------------------------<br />
 CPU timing information for all processes<br />
 ========================================<br />
 0: 8853.469 + 133.948 = 8987.417<br />
 1: 8850.817 + 126.587 = 8977.405<br />
 2: 8851.925 + 128.576 = 8980.501<br />
 3: 8847.992 + 125.871 = 8973.864<br />
 ----------------------------------------<br />
 ddikick.x: exited gracefully.<br />

blahblahblah

I want to harvest the number of basis functions (210 in this example) and the total amount of CPU times.

The line "NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS =" is unique to each file; ie, if I open the file in a text editor and search using this string, I only return this one line. Similarly for "CPU timing information for all processes" and "exited gracefully".

I appreciate that it appears that I haven't done a lot to help myself but I just don't know where to start. If someone could point me in the right direction, I hope to be able to fill in the rest.

After the help given to me by @Ben (see below) here is the code that I ended up using,

filesearch <- function (x) {

f <- readLines(x)
cline <- grep("NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS",f,
                    value=TRUE)
val <- as.numeric(str_extract(cline,"[0-9]+$"))
coline <- grep("^ +CPU timing information", f)
numstr <- sapply(str_extract_all(f[coline+2:5],"[0-9.]+"),as.numeric)
cline1 <- sum(numstr[4,])/60
output <- c(val, cline1)
return(cat(output,"\n"))
}

I sourced this function and keyed in the file that I needed each time, then I transferred the two results to another file by hand. Not as elegant as I'd like but it saved me a lot of time doing it this way. Thanks again to @Ben.

BenMorel
  • 34,448
  • 50
  • 182
  • 322
DarrenRhodes
  • 1,431
  • 2
  • 15
  • 29

1 Answers1

7

maybe

library(stringr)
f <- readLines("datafile.txt")
cline <- grep("NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS",f,
                    value=TRUE)
val <- as.numeric(str_extract(cline,"[0-9]+$"))

will work?

To get the other values, try

cline <- grep("^ +CPU timing information",f)
(numstr <- sapply(str_extract_all(f[cline+2:5],"[0-9.]+"),as.numeric))
##         [,1]     [,2]     [,3]     [,4]
## [1,]    0.000    1.000    2.000    3.000
## [2,] 8853.469 8850.817 8851.925 8847.992
## [3,]  133.948  126.587  128.576  125.871
## [4,] 8987.417 8977.405 8980.501 8973.864

The sapply has transposed the matrix of values, so the last row is the bit we want (corresponds to the last column in the file). Extract it using numstr[4,] or numstr[nrow(numstr),] or tail(numstr,1).

(edit: allow spaces before the "CPU timing" string) (edit: do it right!)

(To do this for all the log files, package it in a function and use list.files(pattern="\\.log$") in combination with sapply ...)

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks, I'll do this in a couple of hrs and report back tomorrow. – DarrenRhodes Jan 10 '13 at 17:01
  • Hi @Ben, if I store the result of readLines as follows, a <- readLines("datafile.txt"), – DarrenRhodes Jan 10 '13 at 21:08
  • Hi @Ben, if I store the result of readLines as follows, a <- readLines("datafile.txt"), and then put the value 'a' into your code, str_extract(grep("NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS",a)),"0-9 +$"), I get the position number but in quotes. Taking this number 'x' and using it as follows, a[x], returns the line I want. So, how can I get an output without the quotes? – DarrenRhodes Jan 10 '13 at 21:16
  • you forgot the `value=TRUE` argument to `grep`, which is critical. I edited my answer to try to make it clearer. – Ben Bolker Jan 10 '13 at 21:44
  • Hi @Ben - the other part of the code, numstr, returns list(). Going up one line, cline returns, integer(0). Will keep doodling and report back. – DarrenRhodes Jan 11 '13 at 09:38
  • If you really have a line that begins "CPU timing information" in your file, then `numstr` should find it ... ? – Ben Bolker Jan 11 '13 at 14:53
  • Hi @Ben, "CPU timing information" is in the file, it's on line 6689 which I can see by looking at what's behind the value f. Your list.files code is looking good, too. – DarrenRhodes Jan 11 '13 at 17:47
  • I see the problem: I was using "^" to insist that "CPU timing" occurred at the beginning of the line; I didn't notice that there was a space preceding the string. I've edited my solution. – Ben Bolker Jan 11 '13 at 18:52
  • Hi @Ben, will apply your changed code on Monday and report back. (My question formatting is still a bit poor). Thanks for sticking with the problem. – DarrenRhodes Jan 11 '13 at 23:50
  • Hi @Ben, going through the code line by line; it works well until I get up to, "To get the other values, try" in this case 'cline' returns 41954. I've interrogated the file by other means and these numbers occur once as part of a coordinate, 0.0419548629. – DarrenRhodes Jan 14 '13 at 11:49
  • 41954 is the line number; `f[cline]` should be the line that contains your desired values – Ben Bolker Jan 14 '13 at 12:55
  • Hi @Ben, just pushed my keypad to one side as I banged my head on the desk a couple of times. Line number 41954 is the line that reads " CPU timing information for all processes". Thanks, again. I just need to puzzle out how to get the sum of four numbers of the last column of the table below this line. – DarrenRhodes Jan 14 '13 at 13:38
  • something like `sum(numstr[,ncol(numstr)])` ? – Ben Bolker Jan 14 '13 at 14:10
  • Hi @Ben, does the str_extract_all extract the numbers from all the lines subsequent to the "CPU timing" line? It appears to be extracting only from that line (which we all now know is line No 41954). – DarrenRhodes Jan 16 '13 at 14:02
  • cheers @Ben. Thanks for this ... I'll try and wrap this into a function and if successful, I'll post it back here; if not, I'll be back asking for more help. – DarrenRhodes Jan 16 '13 at 14:37