9

During my NodeJS learning journey I found this sample code in a book (NodeJS in Practice) which uses streams to find some matches in data coming from another stream.

var Writable = require('stream').Writable;
var util = require('util');
module.exports = CountStream;
util.inherits(CountStream, Writable);

function CountStream(matchText, options) {
    Writable.call(this, options);
    this.count = 0;
    this.matcher = new RegExp(matchText, 'ig');
}

CountStream.prototype._write = function(chunk, encoding, cb) {
    var matches = chunk.toString().match(this.matcher);
    if (matches) {
        this.count += matches.length;
    }
    cb();
};

CountStream.prototype.end = function() {
    this.emit('total', this.count);
};

And the code which uses the stream:

var CountStream = require('./countstream');
var countStream = new CountStream('book');
var http = require('http');

http.get('http://www.manning.com', function(res) {
    res.pipe(countStream);
});

countStream.on('total', function(count) {
    console.log('Total matches:', count);
});

Isn't it possible to lose some matches, if a match breaks in two chunks of data?

For example first chunk of data contain 'This a bo' and the other chunk contains 'ok of mine.' which no one has not the book independently but the whole data contains a book.

What would be the best solution to find all matches?

mscdex
  • 104,356
  • 15
  • 192
  • 153
mehrandvd
  • 8,806
  • 12
  • 64
  • 111
  • Well spotted. Yes I'd say it can loose matches. Probably not very often, because I'd guess chunks would be large, which makes the bug intermittent - the worst sort of bug. – James Jun 28 '15 at 13:38
  • 2
    Indeed it is. Also if the pattern size is greater than the chunk size (which is usually not so much of a problem for most use cases). One way to avoid this - if you just need to find substring matches - would be to use [KMP](https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm) or any other algorithm which can be made to work in a stream-based manner. @James: Beat me to it! I'm a slow typist. Sigh. – galactocalypse Jun 28 '15 at 13:42
  • Do you actually need regex, or are you looking for plain matches? – Bergi Jun 28 '15 at 13:58
  • @Bergi I think searching regex is more challenging. Not using regex makes it easier because the length of searching string will be predictable? – mehrandvd Jun 28 '15 at 14:06
  • @galactocalypse KMP seems a good idea. – mehrandvd Jun 28 '15 at 14:11
  • @mehrandvd: Yes, fixed strings would be much more easy, because it simplifies testing whether the end of a chunk might be the start of a new match – Bergi Jun 28 '15 at 14:15
  • @mehrandvd: Of course you can evaluate regex matches on a streamed input as well, but you will have to write the algorithm yourself and cannot use the builtin `RegExp` methods. Writing a string matching algorithm is much more simple :-) – Bergi Jun 28 '15 at 14:17
  • 1
    @Bergi Agree. I think an answer that just matches strings would be sufficient in the situation. – mehrandvd Jun 29 '15 at 04:56
  • If your RegExp pattern has a maximum length (say L), you could use this doing a little cache : in your _write function, you keep the (L-1) last character of the previous chunk and concatenate it with the new incoming chunk. Once this is done, you can match your RegExp without lose any match. But, as James spotted it, you need your chunk are greater than your max length L. – Sébastien Doncker Jul 01 '15 at 14:46
  • @SébastienDoncker Good suggestion. But is there anyway to find out the maximum length of the strings that match a Regex, say MaxLength(regex)? Maybe I should move it to another question. – mehrandvd Jul 01 '15 at 18:44
  • @SébastienDoncker I posted another question for that question: http://bit.ly/1U9JDwK But the question here remained open. Why don't you post your suggested code as answer? – mehrandvd Jul 01 '15 at 21:44

1 Answers1

1

So, Like I explain in my comments, if you know the max length of strings matched by your regex (to compute the max length, see the very good answer at https://stackoverflow.com/a/31173778/4114922), you could cache the previous chunk and concatenate it to the new chunk. With this method, I think you're not going to lose any match.

var Writable = require('stream').Writable;
var util = require('util');
module.exports = CountStream;
util.inherits(CountStream, Writable);

function CountStream(matchText, maxPatternLength, options) {
    Writable.call(this, options);
    this.count = 0;
    this.matcher = new RegExp(matchText, 'ig');

    this.previousCache = undefined;
    this.maxPatternLength = maxPatternLength;
}

CountStream.prototype._write = function(chunk, encoding, cb) {
    var text;
    if(this.previousCache === undefined) {
        text = chunk.toString();
    }
    else {
        text = this.previousCache + chunk.toString();
    }
    var matches = text.match(this.matcher);
    if (matches) {
        this.count += matches.length;
    }

    this.previousCache = text.substring(text.length - this.maxPatternLength);

    cb();
};

CountStream.prototype.end = function() {
    this.emit('total', this.count);
};
Community
  • 1
  • 1