Parsing a stream without clipping

Question

I'm reading a stream, which is tested with a regex:

var deviceReadStream = fs.createReadStream("/path/to/stream");

deviceReadStream.on('data',function(data){
  if( data.match(aRegex) )
    //do something
});

But as the stream is splitted into several chuncks, it is possible that the cut make me miss a match. So there is a better pattern to test continuously a stream with a regex?

more details

The stream is the content of a crashed filesystem. I am searching for a ext2 signature (0xef53). As I do not know how the chunks are splitted, the signature could be splitted and not being detected.

So I used a loop to be able to delimite myself how the chunks are splitted, ie by block of the filesystem.

But using streams seems to be a better pattern, so how can I use streams while defining myself the chunks size ?

What kind of data do you have coming in; What is the regular expression that you're checking with? — d0nut, Aug 21 '15 at 13:29
If the chunks you get from the stream are multiples of what is expected to be matched - there probably wont be a problem. However, this seems incredibly unlikely if we have a random regex and random chunks. Therefore, what are the regex and the chunks? — ndnenkov, Aug 21 '15 at 14:35

score 6 · Answer 1 · edited May 23 '17 at 12:31

6

Assuming your code just needs to search for the signature 0xef53 (as specified in the"more details" part of your question...

One way to do this and keep using regex is keep a reference to the previous data buffer, concatenate it with the current data buffer, and run the regex on that. Its a bit heavy on cpu usage since it effectively scans each data buffer twice (and there's lots of memory allocation due to the concatenation). It is relatively easy to read so it should be maintainable in the future.

Here's an example of what the code would look like

var deviceReadStream = fs.createReadStream("/path/to/stream");
var prevData = '';

deviceReadStream.on('data',function(data){
  var buffer = prevData + data;
  if( buffer.match(aRegex) )
    //do something

  prevData = data;
});

Another option would be to more manually do the character comparisons so the code can catch when the signature is split across data buffers. You can see a a solution to that in this related question Efficient way to search a stream for a string. According to the blog post of the top answer, the Haxe code he wrote can be built to produce JavaScript which you can then use. Or you could write your own custom code to do the search, since the signature that you're looking for is only 4 characters long.

edited May 23 '17 at 12:31

Community

1
1

answered Aug 22 '15 at 15:23

Ed Ballot

3,405
1
17
24

Your solution implies that datas are processed twice. And if the searched sequence is longer than two chunks, the problem is still here. I think the solution in your linked post, with a partial search, is a good way to follow. But it is no more possible to use regexs, – Gaël Barbin Aug 22 '15 at 17:48
the solution works for more than two chunks. For each chunk, it uses the previous and current chunk. So it should always find the signature, unless the chunks are somehow only 1 or 2 characters long. It seems unlikely the chunks would be that small. – Ed Ballot Aug 22 '15 at 19:35
It is working for my example, as the searched sequence is likely to be smaller than the chunks, but with a sequence with undefined length (for example: /aaa.*bbb/, the problem persists. – Gaël Barbin Aug 22 '15 at 21:00
1

@Gael: There is no way you can confidently tell whether a regex matches something without seeing at least as many characters from the current position to the end of the match. One implementation of regex on stream is Java Scanner, but it still has to keep as much content of the stream in the memory as needed to be certain that more input will not change the match result. – nhahtdh Aug 24 '15 at 05:53

score 2 · Accepted Answer · answered Aug 25 '15 at 21:10

First, if you are determined to use a regex with nodejs, give pcre a try. A node wrapper for pcre is available. Pcre can be configured to do partial matches that can resume across buffer boundaries.

You might, though, just grep (or fgrep for multiple static strings) for a byte offset from the terminal. You can then follow it up with xxd and less to view it or dd to extract a portion.

For example, to get offsets with grep:

grep --text --byte-offset --only-matching --perl-regex "\xef\x53" recovery.img

Note that grep command line options can vary depending on your distro.

You could also look at bgrep though I haven't used it.

I have had good luck doing recovery using various shell tools and scripts.

A couple of other tangential comments:

Keep in mind the endianness of whatever you are searching.
Take an image since you are doing a recovery, if you have not already. Among other perils, if a device is starting to fail, further access can make it worse.
Reference data carving tools. ref
As you mentioned files may be fragmented. Still I would expect that partitions and files start on sector boundaries. As far as I know the magic would not typically be split.
Be careful not to inadvertently write to the device you are recovering.
As you may know, if you reconstruct the image you may be able to mount the image using a loopback driver.

Thank you for your answer, partial matches seems to be interesting, could you provide an example of use? — Gaël Barbin, Aug 27 '15 at 18:43
@Gael It looks like the node pcre module does not currently expose the DFA interface which supports resume. The exposed API can tell you that the end of the string triggered a partial match, but it cannot resume on the next buffer. It looks straightforward to extend the library, but in the meantime do you want an example in another language? — user650881, Sep 02 '15 at 19:12

surui · Answer 3 · 2015-08-27T22:49:45.223

I would go with looking at the data stream as a moving window of size 6 bytes.

For example, if you have the following file (in bytes): 23, 34, 45, 67, 76

A moving window of 2 passing over the data will be:

[23, 34]
[34, 45]
[45, 67]
[67, 76]

I propose going over these windows looking for your string.

var Stream = require('stream');
var fs = require('fs');

var exampleStream = fs.createReadStream("./dump.dmp");
var matchCounter = 0;
windowStream(exampleStream, 6).on('window', function(buffer){
    if (buffer.toString() === '0xEF53') {
        ++matchCounter;
    }
}).on('end', function(){
    console.log('done scanning the file, found', matchCounter);
});
function windowStream(inputStream, windowSize) {
    var outStream = new Stream();
    var soFar = [];
    inputStream.on('data', function(data){
        Array.prototype.slice.call(data).forEach(function(byte){
            soFar.push(byte);
            if (soFar.length === windowSize) {
                outStream.emit('window', new Buffer(soFar));
                soFar.shift();
            }
        });
    });
    inputStream.on('end', function(){
        outStream.emit('end');
    });
    return outStream;
}

Usually I'm not a fan of going over bytes when you actually need the underling string. In UTF-8 there are cases where it might cause some issues, but assuming everything is in English it should be fine. The example can be improved to support these cases by using a string decoder

EDIT

Here is a UTF8 version

var Stream = require('stream');
var fs = require('fs');

var exampleStream = fs.createReadStream("./dump.dmp", {encoding: 'utf8'});
var matchCounter = 0;

windowStream(exampleStream, 6).on('window', function(windowStr){
    if (windowStr === '0xEF53') {
        ++matchCounter;
    }
}).on('end', function(){
    console.log('done scanning the file, found', matchCounter);
});
function windowStream(inputStream, windowSize) {
    var outStream = new Stream();
    var soFar = "";
    inputStream.on('data', function(data){
        Array.prototype.slice.call(data).forEach(function(char){
            soFar += char;
            if (soFar.length === windowSize) {
                outStream.emit('window', soFar);
                soFar = soFar.slice(1);
            }
        });
    });
    inputStream.on('end', function(){
        outStream.emit('end');
    });
    return outStream;
}

Parsing a stream without clipping

3 Answers3