1

I have a file, "prf003.out",

 150 lines of blah....~tables that report other things in this text file deleted.....



     Aboveground Live    Belowground                        Forest             Total    Total     Carbon
     ----------------- -----------------    Stand  -------------------------    Stand  Removed   Released
YEAR    Total    Merch     Live     Dead     Dead      DDW    Floor  Shb/Hrb   Carbon   Carbon  from Fire
--------------------------------------------------------------------------------------------------------------
2000     15.6     15.6      6.0      0.5      0.0      4.5      2.6      0.0     29.1      0.0        0.0
2001     15.6     15.6      6.0      0.4      0.0      4.2      2.5      0.0     28.7      0.0        0.0
2002     15.6     15.6      6.0      0.4      0.0      3.9      2.5      0.0     28.4      0.0        0.0
2003     15.6     15.6      6.0      0.4      0.0      3.7      2.5      0.0     28.1      0.0        0.0
2004     15.6     15.6      6.0      0.4      0.0      3.5      2.5      0.0     27.9      0.0        0.0
2005     16.6     16.6      6.0      1.0      1.3      3.6      2.5      0.0     30.9      0.0        0.0
2006     16.6     16.6      6.0      0.9      0.8      3.8      2.4      0.0     30.6      0.0        0.0
2007     16.6     16.6      6.0      0.9      0.6      3.8      2.4      0.0     30.3      0.0        0.0
2008     16.6     16.6      6.0      0.9      0.4      3.7      2.4      0.0     30.0      0.0        0.0
2009     16.6     16.6      6.0      0.8      0.2      3.7      2.4      0.0     29.8      0.0        0.0
2010     18.1     18.1      6.3      1.2      1.0      3.8      2.4      0.0     32.8      0.0        0.0
2011     18.1     18.1      6.3      1.1      0.6      4.0      2.4      0.0     32.5      0.0        0.0
2012     18.1     18.1      6.3      1.1      0.4      3.9      2.4      0.0     32.2      0.0        0.0
2013     18.1     18.1      6.3      1.0      0.3      3.9      2.4      0.0     31.9      0.0        0.0
2014     18.1     18.1      6.3      1.0      0.2      3.8      2.4      0.0     31.7      0.0        0.0
2015     19.1     19.1      6.5      1.4      1.1      3.9      2.4      0.0     34.3      0.0        0.0
2016     19.1     19.1      6.5      1.3      0.7      4.1      2.4      0.0     34.0      0.0        0.0
2017     19.1     19.1      6.5      1.3      0.5      4.0      2.4      0.0     33.8      0.0        0.0
2018     19.1     19.1      6.5      1.2      0.3      4.0      2.4      0.0     33.5      0.0        0.0
2019     19.1     19.1      6.5      1.2      0.2      3.9      2.4      0.0     33.2      0.0        0.0
2020     19.0     19.0      6.3      1.9      1.8      4.2      2.4      0.0     35.6      0.0        0.0
2021     19.0     19.0      6.3      1.8      1.3      4.5      2.4      0.0     35.3      0.0        0.0
2022     19.0     19.0      6.3      1.7      1.0      4.6      2.4      0.0     35.0      0.0        0.0
2023     19.0     19.0      6.3      1.6      0.7      4.6      2.4      0.0     34.7      0.0        0.0
2024     19.0     19.0      6.3      1.6      0.5      4.6      2.4      0.0     34.4      0.0        0.0
2025     19.0     19.0      6.3      2.2      2.0      4.9      2.4      0.0     36.7      0.0        0.0
2026     19.0     19.0      6.3      2.1      1.3      5.3      2.4      0.0     36.4      0.0        0.0
2027     19.0     19.0      6.3      2.0      1.0      5.4      2.4      0.0     36.0      0.0        0.0
2028     19.0     19.0      6.3      1.9      0.7      5.4      2.4      0.0     35.7      0.0        0.0
2029     19.0     19.0      6.3      1.9      0.5      5.4      2.4      0.0     35.4      0.0        0.0
2030     19.4     19.4      6.5      2.2      1.4      5.6      2.4      0.0     37.5      0.0        0.0
2031     19.4     19.4      6.5      2.1      0.8      5.9      2.4      0.0     37.2      0.0        0.0
2032     19.4     19.4      6.5      2.0      0.6      5.9      2.4      0.0     36.8      0.0        0.0
2033     19.4     19.4      6.5      1.9      0.4      5.8      2.4      0.0     36.5      0.0        0.0
2034     19.4     19.4      6.5      1.9      0.3      5.7      2.4      0.0     36.1      0.0        0.0
2035     18.6     18.6      6.3      2.6      2.1      6.0      2.4      0.0     38.0      0.0        0.0
2036     18.6     18.6      6.3      2.5      1.5      6.4      2.4      0.0     37.6      0.0        0.0
2037     18.6     18.6      6.3      2.4      1.1      6.4      2.4      0.0     37.2      0.0        0.0
2038     18.6     18.6      6.3      2.3      0.8      6.5      2.4      0.0     36.9      0.0        0.0
2039     18.6     18.6      6.3      2.2      0.6      6.5      2.4      0.0     36.5      0.0        0.0
2040     19.4     19.4      6.7      2.3      1.0      6.6      2.4      0.0     38.3      0.0        0.0
2041     19.4     19.4      6.7      2.2      0.6      6.6      2.4      0.0     38.0      0.0        0.0
2042     19.4     19.4      6.7      2.1      0.5      6.5      2.4      0.0     37.6      0.0        0.0
2043     19.4     19.4      6.7      2.0      0.4      6.4      2.4      0.0     37.3      0.0        0.0
2044     19.4     19.4      6.7      2.0      0.3      6.3      2.4      0.0     36.9      0.0        0.0
2045     17.9     17.9      6.3      2.8      2.5      6.6      2.4      0.0     38.5      0.0        0.0
2046     17.9     17.9      6.3      2.7      1.8      7.0      2.4      0.0     38.1      0.0        0.0
2047     17.9     17.9      6.3      2.6      1.4      7.1      2.4      0.0     37.7      0.0        0.0
2048     17.9     17.9      6.3      2.5      1.0      7.2      2.4      0.0     37.3      0.0        0.0
2049     17.9     17.9      6.3      2.4      0.7      7.2      2.4      0.0     36.9      0.0        0.0


blah.....a few more tables

that I am trying to extract this particular table from. As you can see "blah" at the top represents a whole bunch of other tables generated in this .txt file. Afterwards, I have a whole bunch of other tables outputted in that same file.

What I am trying to do is similar to this question, but now I am stuck: Extracting Data from Text Files

Here is what I did:

data <- readLines("prf003.out")
data
#VALUE=TRUE RETURNS EXACT MATCH OF TEXT.
cline <- grep("YEAR    Total    Merch     Live     Dead     Dead      DDW    Floor  Shb/Hrb   Carbon   Carbon  from Fire", data, value= FALSE)
cline

#dont use str_extract, use str_extract_all
numstr <- sapply(str_extract_all(data[cline+1:51],"[0-9]"),as.numeric)
numstr

However, the output I get is wonky and doesn't format my data the way I want (i.e. just give me a copy of the original table so I can process it in R)

    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38] [,39]
 [1,]    2    2    2    2    2    2    2    2    2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2
 [2,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 [3,]    0    0    0    0    0    0    0    0    0     0     1     1     1     1     1     1     1     1     1     1     2     2     2     2     2     2     2     2     2     2     3     3     3     3     3     3     3     3     3
 [4,]    0    1    2    3    4    5    6    7    8     9     0     1     2     3     4     5     6     7     8     9     0     1     2     3     4     5     6     7     8     9     0     1     2     3     4     5     6     7     8
 [5,]    1    1    1    1    1    1    1    1    1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 [6,]    5    5    5    5    5    6    6    6    6     6     8     8     8     8     8     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     8     8     8     8
 [7,]    6    6    6    6    6    6    6    6    6     6     1     1     1     1     1     1     1     1     1     1     0     0     0     0     0     0     0     0     0     0     4     4     4     4     4     6     6     6     6
 [8,]    1    1    1    1    1    1    1    1    1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 [9,]    5    5    5    5    5    6    6    6    6     6     8     8     8     8     8     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     8     8     8     8
[10,]    6    6    6    6    6    6    6    6    6     6     1     1     1     1     1     1     1     1     1     1     0     0     0     0     0     0     0     0     0     0     4     4     4     4     4     6     6     6     6
[11,]    6    6    6    6    6    6    6    6    6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6
[12,]    0    0    0    0    0    0    0    0    0     0     3     3     3     3     3     5     5     5     5     5     3     3     3     3     3     3     3     3     3     3     5     5     5     5     5     3     3     3     3
[13,]    0    0    0    0    0    1    0    0    0     0     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     2     2     2     1     1     2     2     2     1     1     2     2     2     2
[14,]    5    4    4    4    4    0    9    9    9     8     2     1     1     0     0     4     3     3     2     2     9     8     7     6     6     2     1     0     9     9     2     1     0     9     9     6     5     4     3
[15,]    0    0    0    0    0    1    0    0    0     0     1     0     0     0     0     1     0     0     0     0     1     1     1     0     0     2     1     1     0     0     1     0     0     0     0     2     1     1     0
[16,]    0    0    0    0    0    3    8    6    4     2     0     6     4     3     2     1     7     5     3     2     8     3     0     7     5     0     3     0     7     5     4     8     6     4     3     1     5     1     8
[17,]    4    4    3    3    3    3    3    3    3     3     3     4     3     3     3     3     4     4     4     3     4     4     4     4     4     4     5     5     5     5     5     5     5     5     5     6     6     6     6
[18,]    5    2    9    7    5    6    8    8    7     7     8     0     9     9     8     9     1     0     0     9     2     5     6     6     6     9     3     4     4     4     6     9     9     8     7     0     4     4     5
[19,]    2    2    2    2    2    2    2    2    2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2
[20,]    6    5    5    5    5    5    4    4    4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4
[21,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
[22,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
[23,]    2    2    2    2    2    3    3    3    3     2     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3
[24,]    9    8    8    8    7    0    0    0    0     9     2     2     2     1     1     4     4     3     3     3     5     5     5     4     4     6     6     6     5     5     7     7     6     6     6     8     7     7     6
[25,]    1    7    4    1    9    9    6    3    0     8     8     5     2     9     7     3     0     8     5     2     6     3     0     7     4     7     4     0     7     4     5     2     8     5     1     0     6     2     9
[26,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
[27,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
[28,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
[29,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0

As you can see, it takes each value of each number and puts that in a new coordinate. I just want the original table.

Paulina
  • 35
  • 1
  • 8
  • 1
    what about using `grep("YEAR")`, and then using this value to skip in `read.table(..., skip=number_from_grep)` – user20650 Dec 12 '18 at 22:10
  • @user20650 I don't think that would work given that I still have many tables after this one that also contain YEAR as a header. Also, it gives me an error, `> read.table("prf003.out", skip= 1073) Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 12 elements` – Paulina Dec 12 '18 at 22:20
  • ah okay: you may also be able to use `nrows` as this will be the difference between the multiple row indices returned by the grep. – user20650 Dec 12 '18 at 22:25
  • Hey! I actually got it when I moved it down to readable headers (i.e. I imported at the values, not at the table name), and added a max number of rows! Thanks for your help @user20650 – Paulina Dec 12 '18 at 22:26
  • The only thing I have to do now is figure out how to change the value at which I import the file...I have 900 files and the line at which this table occurs slightly varies for each one...any ideas @user20650? – Paulina Dec 12 '18 at 22:28
  • yes, just noted another tricky bit: some of your headers have multiple words on the same line. Are the column header & number of columns always the same? Ps are there any blank rows after a table or is it straight to text -- is this consistent? – user20650 Dec 12 '18 at 22:32
  • Yes, but for the purposes of my research I only need the values in the first column anyways (total carbon). I dont need the column names. So the final solution was by using `cline <- 2+ grep("YEAR Total Merch Live Dead Dead DDW Floor Shb/Hrb Carbon Carbon from Fire", data, value= FALSE) read.table("prf003.out", skip= cline, nrow= 49)` I will be able to get a dataframe and just extract that, rinse and repeat for all my tables, and finish my thesis! haha. – Paulina Dec 12 '18 at 22:37
  • if there is a consistent structure in terms of how the data is recorded after the numerical data it should be possible to automate the number of rows. – user20650 Dec 12 '18 at 22:39

1 Answers1

0

What about something like this

# figure out where the headers are & where the data starts
dataHeader1 <- which(grepl("Aboveground", txtFile))
dataHeader2 <- dataHeader1 + 2
dataStart <- dataHeader2 + 2
# extract the data
txtDat <- txtFile[dataStart:length(txtFile)]
txtDat <- do.call(rbind, strsplit(txtDat, split = "\\s{1,}", perl = TRUE))
class(txtDat) <- "numeric"
txtDat
# returns
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
 [1,] 2000 15.6 15.6  6.0  0.5  0.0  4.5  2.6    0  29.1     0     0
 [2,] 2001 15.6 15.6  6.0  0.4  0.0  4.2  2.5    0  28.7     0     0
 [3,] 2002 15.6 15.6  6.0  0.4  0.0  3.9  2.5    0  28.4     0     0
 [4,] 2003 15.6 15.6  6.0  0.4  0.0  3.7  2.5    0  28.1     0     0
 [5,] 2004 15.6 15.6  6.0  0.4  0.0  3.5  2.5    0  27.9     0     0
 [6,] 2005 16.6 16.6  6.0  1.0  1.3  3.6  2.5    0  30.9     0     0
 ....

Note that one can sharpen the regex in order to determine where the data starts e.g.

dataHeader1 <- which(grepl("(?=.*Aboveground)(?=.*Carbon)", txtFile, perl = TRUE))
# this can be pursued arbitrarily

. I read the data via txtFile <- readLines("Path/To/test.txt") and the raw data itself looks like this

 [1] "asdsalkjdaskldas+"                                                                                             
 [2] "jsafhnjadfnhdjkasfafdajfbnjasbfjads.kbnjdasnfadsnf"                                                            
 [3] "45453342542542kj             ijholijfkqaef45435314"                                                            
 [4] ""                                                                                                              
 [5] "dasfjasikedfnha4454 "                                                                                          
 [6] "a"                                                                                                             
 [7] "a"                                                                                                             
 [8] "fdgfd"                                                                                                         
 [9] "\t\t6546346343"                                                                                                  
[10] ""                                                                                                              
[11] ""                                                                                                              
[12] "     Aboveground Live    Belowground                        Forest              Total    Total     Carbon"      
[13] "     ----------------- -----------------    Stand  -------------------------    Stand  Removed   Released"     
[14] "YEAR    Total    Merch     Live     Dead     Dead      DDW    Floor  Shb/Hrb   Carbon   Carbon  from Fire"     
[15] "--------------------------------------------------------------------------------------------------------------"
[16] "2000     15.6     15.6      6.0      0.5      0.0      4.5      2.6      0.0     29.1      0.0        0.0"  
...
niko
  • 5,253
  • 1
  • 12
  • 32
  • That might work too, but I am using the other users solution for now. `cline <- 1+ grep("YEAR Total Merch Live Dead Dead DDW Floor Shb/Hrb Carbon Carbon from Fire", data, value= FALSE) cline table<-read.table("prf007.out", skip= cline, nrow= 49)` – Paulina Dec 12 '18 at 23:26