2

I am trying to replicate the -A and -B argument functionality from GNU grep in Rust. This prints out lines of text before and after a matched line read from a file or stdin. Example:

$ printf '1\n2\n3\nfoo\n4\n5\n6\n7\nfoo\n8\n9' | grep -A 1 -B 1 foo
3
foo
4
--
--
7
foo
8

My desired output would return n lines preceding and/or following a pattern match.

Using just stdin as an example, the simple case of just returning the matched line is easy to implement like this:

use std::io::{self, BufRead, BufReader, Result};
fn main() {
    for line in BufReader::new(io::stdin()).lines() {
        match line {
            Ok(l) => {
                if l.contains("foo"){
                    println!("{}", l);
                }
            }
            Err(e) => println!("error parsing line: {:?}", e),
        }
    }
}

output:

$ printf '1\n2\n3\nfoo\n4\n5\n6\n7\nfoo\n8\n9' | cargo run
foo
foo

However, returning the surrounding lines in an iterator like this does not seem possible, since the previous lines are no longer accessible upon each iteration of the loop.

I saw the Windows type, but it only works on a slice.

This sliding_windows crate appears to be the kind of functionality I want, but I cannot figure out how to get it to work with an iterator over lines from a file.

I also looked at itertools but did not see any kind of window iterator function that did this.

Before I get too deep in rolling my own form of line-buffer object in order to cache n previously seen lines (maybe some kind of ring buffer?), I was hoping that there might already be some method available in Rust that could easily accomplish this.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
user5359531
  • 3,217
  • 6
  • 30
  • 55
  • [Are there equivalents to slice::chunks/windows for iterators to loop over pairs, triplets etc?](https://stackoverflow.com/q/42134874/155423) – Shepmaster Aug 13 '19 at 00:52
  • Thanks for the link but "The best way to have chunks and windows on an arbitrary iterator/collection is to first collect it into a Vec and iterate over that." -> this is exactly what I am trying to avoid. I am trying to replicate the exact functionality of `grep` in Rust, so having to read the entire file into memory defeats the purpose of the program. I only want to hold `n` + 1 lines in memory at a time; the line being iterated over, and its previous & following lines in the input stream. – user5359531 Aug 13 '19 at 00:59
  • Keep reading to the part of that answer that has a complete, ready-to-go code snippet that creates the chunking thing you want. "If dynamic allocations are accepted, then it is possible to use `Vec` as the Item of the chunking iterator." – Shepmaster Aug 13 '19 at 01:02
  • Thanks, I got a simple version of the answer there to work and output the line following a match. It even has the exact same memory footprint as GNU `grep`, ~900KB while iterating over a 3GB file, though it takes 4x as long (4min vs 1min). However I am not sure how to extend this to also include the lines preceeding a match, might still need a buffer of some kind? – user5359531 Aug 13 '19 at 01:49
  • [Rust file I/O is very slow compared with C. Is something wrong?](https://stackoverflow.com/q/43028653/155423); [Why is my Rust program slower than the equivalent Java program?](https://stackoverflow.com/q/25255736/155423). – Shepmaster Aug 13 '19 at 01:50
  • Thanks, compiling with `cargo --release` has dropped the execution time down to ~15-20s, which is even faster than the equivalent GNU `grep -A`. – user5359531 Aug 13 '19 at 13:50

2 Answers2

3

Implementing this efficiently is pretty tricky, and your instinct to use a roll buffer is pretty much on the money. This is what both GNU grep and ripgrep do. If you're willing to incur some dependencies, then you can pretty much achieve what you want by relying on some of ripgrep's internal libraries. For example, here's a program that does what you want which makes use of the grep-searcher crate:

use std::error::Error;
use std::io;

use grep_regex::RegexMatcher;
use grep_searcher::{Searcher, SearcherBuilder, Sink, SinkContext, SinkMatch};

fn main() -> Result<(), Box<dyn Error>> {
    let re = RegexMatcher::new(r"foo")?;
    let mut searcher = SearcherBuilder::new()
        .before_context(1)
        .after_context(1)
        .build();
    searcher.search_reader(
        &re,
        io::stdin().lock(),
        MySink(io::stdout().lock()),
    )?;
    Ok(())
}

struct MySink<W>(W);

impl<W: io::Write> Sink for MySink<W> {
    type Error = io::Error;

    fn matched(
        &mut self,
        _: &Searcher,
        mat: &SinkMatch,
    ) -> Result<bool, io::Error> {
        self.0.write_all(mat.bytes())?;
        Ok(true)
    }

    fn context(
        &mut self,
        _: &Searcher,
        ctx: &SinkContext,
    ) -> Result<bool, io::Error> {
        self.0.write_all(ctx.bytes())?;
        Ok(true)
    }

    fn context_break(
        &mut self,
        _: &Searcher,
    ) -> Result<bool, io::Error> {
        self.0.write_all(b"--\n")?;
        Ok(true)
    }
}

With the following dependencies:

[dependencies]
grep-regex = "0.1.5"
grep-searcher = "0.1.6"

Its output is:

$ printf '1\n2\n3\nfoo\n4\n5\n6\n7\nfoo\n8\n9' | ./target/release/grepex
3
foo
4
--
7
foo
8

The grep-regex dependency could be dropped if you don't need regexes, but it will require writing a bit more code to provide your own implementation of grep-matcher::Matcher (which is easier than it looks, if all you need is simple substring search).

If you did want to implement this yourself, then you could try reading the implementation inside grep-searcher. Effectively, it's all built on top of a roll buffer.

If performance is less of a concern, then you could do a line-by-line loop, and keep enough buffer available to store the previous N lines (where N is the size of your "before" window). When you find a match, print the previous N lines from your buffer. At the same time, start a counter at N that decreases by 1 with each successive line. While the counter is above 0, print the line, which will correspond to the "after" window. (This is not that dissimilar to what you'd do with a roll buffer.)

BurntSushi5
  • 13,917
  • 7
  • 52
  • 45
1

One approach is to develop a state machine that keeps a window of the last 2N+1 lines at all times. When a matching line is found, it adds a new entry to a list of to-be-printed pending matches, associated with a future line number. When that future line number is reached, the entries with that line number are printed, along with their lines of context pulled from the context window, keeping in mind that matches close to the beginnig of the input have fewer than N leading context lines. When the end of input is reached, any entries still pending are printed, keeping in mind these have fewer than N trailing context lines.

Good luck fighting Rust with all this mutation!

Kaz
  • 55,781
  • 9
  • 100
  • 149
  • yeah this is roughly what I was working on for my own implementation, basically a `Vec` that gets `push`'d each line then truncated to `n` size every iteration, when a match is found print the Vec contents. This would work for the lines preceeding the match. Have not figured out how to handle the lines after the match yet though. I was thinking that maybe [`itertools::multipeek`](https://docs.rs/itertools/0.8.0/itertools/fn.multipeek.html) might help for the lines following the match, but I cant find a working example of how to use it – user5359531 Aug 13 '19 at 01:04