0

I have a text file with tens of thousands of rows, with time stamps such as 2010 5 3 0 0 interspersed in between. They are not consistent, but the 2 rows are.

How can I import the 2 columns (trial and the number), while ignoring the rows where I have these timestamps?

a <- read.table('test.txt')

Currently, I get this error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 5 did not have 2 elements

Data

 Trial  0.214526266019124
 Trial  0.213914388985549
 Trial  0.213886659329060
 Trial  0.213886587273578
2010  5  3  0  0
 Trial  0.213886587273578
 Trial  0.213256610071994
 Trial  0.213232963405967
 Trial  0.213232928149832
2011  2  3  0  0
 Trial  0.213886587273578
 Trial  0.213256610071994
 Trial  0.213232963405967
 Trial  0.213232928149832
 Trial  0.213886587273578
 Trial  0.213256610071994
 Trial  0.213232963405967
2011  2  6  0  0
maximusdooku
  • 5,242
  • 10
  • 54
  • 94
  • 1
    I would use `readLines()` to read it in and filter them out afterwards using some regex command. You can export the result character vector using `sink()` and `cat()` to write it back into a text file. The smarter way of doing this is using the command line to delete certain lines within text files. This should be a `UNIX` related question to help filter your data. – InfiniteFlash Jan 17 '18 at 20:30
  • I am thinking a) readlines b) ignore lines that don't have `trial`...Not sure if it will work. Trying.. – maximusdooku Jan 17 '18 at 20:31
  • See this post, this answers your question. https://stackoverflow.com/a/25682303/5874001 – InfiniteFlash Jan 17 '18 at 20:34

3 Answers3

3

You can use read.table (or other function) in combination with grep:

read.table(text=grep("Trial", readLines(path_to_your_file), value=TRUE))

Does this solve your problem?

Vincent Bonhomme
  • 7,235
  • 2
  • 27
  • 38
  • This is the same answer (basically) as in the link I posted. You could argue the OP has posted a duplicate question. – InfiniteFlash Jan 17 '18 at 20:36
  • I was preparing the answer and haven't seen your comment and we have something much simpler here, don't we? – Vincent Bonhomme Jan 17 '18 at 20:39
  • Yeah, it is I guess. It's the same idea really, a good one! If the OP doesn't get marked as a duplicate, yours should be the right answer. – InfiniteFlash Jan 17 '18 at 20:43
  • Thanks! Succinct. I tried is over a subset of the data, and it works, but it's really slow for thousands of lines... – maximusdooku Jan 17 '18 at 20:44
  • There's not really a work around using readLines here I think. If you're concerned about speed, JeanVuda's suggestion might be quicker, but processing your data beforehand using what I recommend in the comments in the OP is what one should normally do. – InfiniteFlash Jan 17 '18 at 20:51
  • You might consider using `fread` from the `data.table` package for speed. It will actually accept a shell command, such as `"grep Trial test.txt"`, for input. – Nathan Werth Jan 17 '18 at 21:32
2

if you have perl, you can do the data cleaning with it and capture the output without leaving R using pipe. Having to escape regex and quotes in the perl "one-liner" makes it a little weird and probably better as it's own script.

The pipe to perl here is maybe more complicated than you need. perl -lne 'print $1 if m/Trial (.*)/' would probably suffice. Below captures the time stamp and appends it to all the lines until timestamp is found. \W+ matches one or more white space characters, but needs the extra escape to be escaped from R's parser and passed to perl: \\W+. \" is used to keep R from thinking the string we are giving to it has ended, while still allowing string delimiters in perl (could use qq(..) instead of "..." in perl).

a <- read.table(
   pipe("perl -lne  '
        BEGIN{$ts=\"0 0 0 0 0\"} 
        chomp; 
        if(/Trial\\W+(.*)/){ 
           print \"$1 $ts\" 
       } else {
         $ts=$_
      }' test.txt"))

for the example data, the output would be

         V1   V2 V3 V4 V5 V6
1 0.2145263    0  0  0  0  0
2 0.2139144    0  0  0  0  0
3 0.2138867    0  0  0  0  0
4 0.2138866    0  0  0  0  0
5 0.2138866 2010  5  3  0  0
6 0.2132566 2010  5  3  0  0
7 0.2132330 2010  5  3  0  0
8 0.2132329 2010  5  3  0  0
Will
  • 1,206
  • 9
  • 22
0
txt<-readLines("C:\\Users\\abc\\Desktop\\new2.txt")
table<-strsplit(txt[grepl("T",substr(txt,1,1))],split = "\\s\\s")
table<-do.call("rbind", table)
JeanVuda
  • 1,738
  • 14
  • 29