R: a grep alternative for a file without using readLines?

Question

Is there any function in any package that can read a text file with regex and return string numbers of found matches. Like gsubfn read.pattern can find and extract a pattern but can't return line number and grep can't read files directly. Example:

file:

  .122448110000D+06  .400000000000D+01                                      
 3 15  3 23 10  0  0.0  .267305411398D-03  .161435309564D-10  .000000000000D+01
  .510000000000D+02  .625000000000D-01  .440982654411D-08  .306376855997D+00
 5 15  3 23 11 59 44.0 -.263226218521D-03  .488853402202D-11  .000000000000D+01

pattern: reg="^ *\\d+ +(?:[0-9]+ +){5}[.0-9]+.*$" for 2nd and 4th line match. So what I generally want is:

>file.grep(file,reg)
[1] 2 4

Is there anything of sorts? I get the general philosophy when dealing with such things is readLines and then getting creative with grep which is fine when files are not that big. But I read here many people having problems with large and not table-structured data sets, things that could be solved with such tool (or with readLines supporting regex skip parameter) and I wonder if anyone made something like that.

score 2 · Answer 1 · edited May 23 '17 at 10:27

2

EDITED1

I just found another post relating to this question with an alternative solution: grep while reading file

ORIGINAL POST

Is this what you are looking for?

library(gsubfn)

cat(" .122448110000D+06  .400000000000D+01
 3 15  3 23 10  0  0.0  .267305411398D-03  .161435309564D-10  .000000000000D+01
 .510000000000D+02  .625000000000D-01  .440982654411D-08  .306376855997D+00
 5 15  3 23 11 59 44.0 -.263226218521D-03  .488853402202D-11  .000000000000D+01", file = "test.txt")
read.pattern(text = readLines("test.txt"), pattern = "^ *\\d+ +(?:[0-9]+ +){5}[.0-9]+.*$")

edited May 23 '17 at 10:27

Community

1
1

answered Jan 17 '16 at 19:59

tomtom

259
1
2
6

A very neat trick indeed, but unfortunately, no. The thing is, I need string numbers to extract stuff between them not the very lines. I want to get out a block between two definite match lines (e.g. below 2nd and until 4th). This would work if read.pattern didn't read a line at a time, then I could go fancy with regex and extract data between matches, but it does so and thus I can't. – ephemeris Jan 17 '16 at 20:43
That other question won't work either - match strings have some additional data about blocks between them, I can't just mash all blocks together - they may have different number of lines in them - specified in matching strings. – ephemeris Jan 17 '16 at 20:51

R: a grep alternative for a file without using readLines?

1 Answers1