1

I am trying to extract tables from text files and have found several earlier posts here that address similar questions. However, none seem to work efficiently with my problem. The most helpful answer I have found is to one of my earlier questions here: R: removing header, footer and sporadic column headings when reading csv file

An example dummy text file contains:

> 
> 
> ###############################################################################
> 
> # Display AICc Table for the models above
> 
> 
> collect.models(, adjust = FALSE)
      model npar  AICc  DeltaAICc weight  Deviance
13      P1   19    94      0.00     0.78      9
12      P2   21    94      2.64     0.20      9
10      P3   15    94      9.44     0.02      9
2       P4   11    94    619.26     0.00      9
> 
> 
> ###############################################################################
> 
> # the three lines below count the number of errors in the code above
> 
> cat("ERROR COUNT:", .error.count, "\n")
ERROR COUNT: 0 
> options(error = old.error.fun)
> rm(.error.count, old.error.fun, new.error.fun)
> 
> ##########
> 
> 

I have written the following code to extract the desired table:

my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')

top    <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'

my.data <- my.data[-c(grep(bottom, my.data):length(my.data))]
my.data <- my.data[-c(1:grep(top, my.data))]
my.data <- my.data[c(1:(length(my.data)-4))]
aa      <- as.data.frame(my.data)
aa

write.table(my.data, 'c:/users/mmiller21/simple R programs/dummy.log.extraction.txt', quote=F, col.names=F, row.name=F)
my.data2 <- read.table('c:/users/mmiller21/simple R programs/dummy.log.extraction.txt', header = TRUE, row.names = c(1))
my.data2
   model npar AICc DeltaAICc weight Deviance
13    P1   19   94      0.00   0.78        9
12    P2   21   94      2.64   0.20        9
10    P3   15   94      9.44   0.02        9
2     P4   11   94    619.26   0.00        9

I would prefer to avoid having to write and then read my.data to obtain the desired data frame. Prior to that step the current code returns a vector of strings for my.data:

[1] "      model npar  AICc  DeltaAICc weight  Deviance" "13      P1   19    94      0.00     0.78      9"   
[3] "12      P2   21    94      2.64     0.20      9"    "10      P3   15    94      9.44     0.02      9"   
[5] "2       P4   11    94    619.26     0.00      9"

Is there some way I can convert the above vector of strings into a data frame like that in dummy.log.extraction.txt without writing and then reading my.data?

The line:

aa <- as.data.frame(my.data)

returns the following, which looks like what I want:

#                                              my.data
# 1       model npar  AICc  DeltaAICc weight  Deviance
# 2    13      P1   19    94      0.00     0.78      9
# 3    12      P2   21    94      2.64     0.20      9
# 4    10      P3   15    94      9.44     0.02      9
# 5    2       P4   11    94    619.26     0.00      9

However:

dim(aa)
# [1] 5 1

If I can split aa into columns then I think I will have what I want without having to write and then read my.data.

I found the post: Extracting Data from Text Files However, in the posted answer the table in question seems to have a fixed number of rows. In my case the number of rows can vary between 1 and 20. Also, I would prefer to use base R. In my case I think the number of rows between bottom and the last row of the table is a constant (here 4).

I also found the post: How to extract data from a text file using R or PowerShell? However, in my case the column widths are not fixed and I do not know how to split the strings (or rows) so there are only seven columns.

Given all of the above perhaps my question is really how to split the object aa into columns. Thank you for any advice or assistance.

EDIT:

The actual logs are produced by a supercomputer and contain up to 90,000 lines. However, the number of lines varies greatly among logs. That is why I was making use of top and bottom.

oguz ismail
  • 1
  • 16
  • 47
  • 69
Mark Miller
  • 12,483
  • 23
  • 78
  • 132
  • 1
    Your data looks like console output from an R session. One wonders why the table hasn't been exported or why you can't run the R code to get it. – Roland Jul 04 '13 at 07:54
  • The R file is run on a supercomputer and the table is taken from the log returned by that machine. I do not know how to ask the supercomputer to export a table for me. – Mark Miller Jul 04 '13 at 08:03

4 Answers4

3

read.table and its family now have an option to read text:

> df <- read.table(text = paste(my.data, collapse = "\n"))
> df
   model npar AICc DeltaAICc weight Deviance
13    P1   19   94      0.00   0.78        9
12    P2   21   94      2.64   0.20        9
10    P3   15   94      9.44   0.02        9
2     P4   11   94    619.26   0.00        9
> summary(df)
 model       npar           AICc      DeltaAICc          weight         Deviance
 P1:1   Min.   :11.0   Min.   :94   Min.   :  0.00   Min.   :0.000   Min.   :9  
 P2:1   1st Qu.:14.0   1st Qu.:94   1st Qu.:  1.98   1st Qu.:0.015   1st Qu.:9  
 P3:1   Median :17.0   Median :94   Median :  6.04   Median :0.110   Median :9  
 P4:1   Mean   :16.5   Mean   :94   Mean   :157.84   Mean   :0.250   Mean   :9  
        3rd Qu.:19.5   3rd Qu.:94   3rd Qu.:161.90   3rd Qu.:0.345   3rd Qu.:9  
        Max.   :21.0   Max.   :94   Max.   :619.26   Max.   :0.780   Max.   :9  
kohske
  • 65,572
  • 8
  • 165
  • 155
  • Thank you. I should have mentioned that the log file contains approximately 20,000 lines, which is why I was making use of top and bottom. However, your answer may help. – Mark Miller Jul 04 '13 at 08:15
3

May be your real log file is totally different and more complex but with this one, you can use read.table directly, you just have to play with the right parameters.

data <- read.table("c:/users/mmiller21/simple R programs/dummy.log",
                   comment.char = ">",
                   nrows = 4,
                   skip = 1,
                   header = TRUE,
                   row.names = 1)

str(data)
## 'data.frame':    4 obs. of  6 variables:
##  $ model    : Factor w/ 4 levels "P1","P2","P3",..: 1 2 3 4
##  $ npar     : int  19 21 15 11
##  $ AICc     : int  94 94 94 94
##  $ DeltaAICc: num  0 2.64 9.44 619.26
##  $ weight   : num  0.78 0.2 0.02 0
##  $ Deviance : int  9 9 9 9

data
##    model npar AICc DeltaAICc weight Deviance
## 13    P1   19   94      0.00   0.78        9
## 12    P2   21   94      2.64   0.20        9
## 10    P3   15   94      9.44   0.02        9
## 2     P4   11   94    619.26   0.00        9
dickoa
  • 18,217
  • 3
  • 36
  • 50
  • Thank you. I should have mentioned that the log file contains approximately 20,000 lines, which is why I was making use of top and bottom. However, your answer may help. – Mark Miller Jul 04 '13 at 08:14
1

It looks strange that you have to read an R console. Whatever, you can use the fact that your table lines begin with a numeric and extract your inetersting line using something like ^[0-9]+. Then read.table like shown by @kohske do the rest.

readLines('c:/users/mmiller21/simple R programs/dummy.log')
idx <- which(grepl('^[0-9]+',ll))
idx <- c(min(idx)-1,idx)   ## header line 
read.table(text=ll[idx])   
 model npar AICc DeltaAICc weight Deviance
13    P1   19   94      0.00   0.78        9
12    P2   21   94      2.64   0.20        9
10    P3   15   94      9.44   0.02        9
2     P4   11   94    619.26   0.00        9
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Thank you. I should have mentioned that the log file contains approximately 20,000 lines, which is why I was making use of top and bottom. However, your answer may help. – Mark Miller Jul 04 '13 at 08:14
0

Thank you to those who posted answers. Because of the size, complexity and variability of the actual log files I think I need to continue to make use of the variables top and bottom. However, I used elements of dickoa's answer to come up with the following.

my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')

top    <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'

my.data <- my.data[-c(grep(bottom, my.data):length(my.data))]
my.data <- my.data[-c(1:grep(top, my.data))]

x <- read.table(text=my.data, comment.char = ">")
x

#    model npar AICc DeltaAICc weight Deviance
# 13    P1   19   94      0.00   0.78        9
# 12    P2   21   94      2.64   0.20        9
# 10    P3   15   94      9.44   0.02        9
# 2     P4   11   94    619.26   0.00        9

Here is even simpler code:

my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')

top    <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'

my.data  <- my.data[grep(top, my.data):grep(bottom, my.data)]

x <- read.table(text=my.data, comment.char = ">")
x
Mark Miller
  • 12,483
  • 23
  • 78
  • 132