3

If I run my regex on the data as a string I have no issues my three lines get matched.

https://regex101.com/r/pHsTvV/1

const regex = /(?<email>((?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])))\s*\|\s*(?<name>([a-zA-Z]{2,}\s[a-zA-Z]{1,}'?-?[a-zA-Z]{2,}\s?([a-zA-Z]{1,})?))\s*\|\s*(?<address>.*)\s*\|\s*(?<country>(\w|\.|\s*){1,})\s*\|\s*(?<phone>(\d|-|\ |\+|\(|\)|\.|\/){7,})/gm;
const str = `john.doe@gmail.test| John Doe| 160 Boston Rd| Chelmsford MA 11824| United States| 00088782000
jane.doe@aol.test| Jane Doe| 8415 45th St| Lyons IL 60534| United States| 0005800000
alicia.random123@gmail.test| Alicia Random| BLK 8, City Point| No.58 Wing Shun Street| Tsuen Wan| Not in U.S.| +00092262000`;

const lines = str.split('\n')
lines.forEach(line => {
    const test = regex.exec(str)
    if (test && test.groups) {
        console.dir(test.groups)
    } else {
        console.log('could not match')
    }
});

However when I load my data from a txt file javascript always gives me one out of two lines not matched:

const regex = /(?<email>((?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])))\s*\|\s*(?<name>([a-zA-Z]{2,}\s[a-zA-Z]{1,}'?-?[a-zA-Z]{2,}\s?([a-zA-Z]{1,})?))\s*\|\s*(?<address>.*)\s*\|\s*(?<country>(\w|\.|\s*){1,})\s*\|\s*(?<phone>(\d|-|\ |\+|\(|\)|\.|\/){7,})/gm;
import * as fs from 'fs';
import * as path from 'path';
import * as es from 'event-stream';
const filePath = path.join(process.cwd(), 'data/test.txt')
var s = fs.createReadStream(filePath)
    .pipe(es.split())
    .pipe(es.mapSync(function (line: string) {
        let values = regex.exec(line.trim())
        if (values && values.groups) {
            console.dir(values.groups)
        } else {
            console.log(`COULD NOT MATCH`)
            console.log(line)
        }
    }).on('error', function (err) {
        console.log('Error while reading file.', err);
    })
        .on('end', function () {
            console.log('Read entire file.')
        })
    )

The test.txt file is as follows:

john.doe@gmail.test| John Doe| 160 Boston Rd| Chelmsford MA 11824| United States| 00088782000
jane.doe@aol.test| Jane Doe| 8415 45th St| Lyons IL 60534| United States| 0005800000
alicia.random123@gmail.test| Alicia Random| BLK 8, City Point| No.58 Wing Shun Street| Tsuen Wan| Not in U.S.| +00092262000

even on a file with 100 lines it is always one line out of two that do not get matched. When I read the file the jane.doe@aol.test does not get matched

I have tried the following to see if its line specific:

const regex = /(?<email>((?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])))\s*\|\s*(?<name>([a-zA-Z]{2,}\s[a-zA-Z]{1,}'?-?[a-zA-Z]{2,}\s?([a-zA-Z]{1,})?))\s*\|\s*(?<address>.*)\s*\|\s*(?<country>(\w|\.|\s*){1,})\s*\|\s*(?<phone>(\d|-|\ |\+|\(|\)|\.|\/){7,})/gm;
const uniqueStr = `jane.doe@aol.test| Jane Doe| 8415 45th St| Lyons IL 60534| United States| 0005800000`

const test = regex.exec(uniqueStr)
if (test && test.groups) {
    console.dir(test.groups)
} else {
    console.log('could not match')
    console.log(uniqueStr)
}

This does not match but if I try the regex on regex101 there are no matches issue.

https://regex101.com/r/52kpRD/1

Mederic
  • 1,949
  • 4
  • 19
  • 36
  • Maybe it has something to do with the line terminators? Maybe also with the `.*` together with the `m` flag. And "one line out of two that do not get matched": Is it 50% get not matched or exactly every second line does not match? – miile7 Sep 17 '21 at 05:45
  • When I run it on 100 lines I get 48 non matches. I dont think its the `.*` because If i copy paste the lines from the txt into regex101 it matches every line. if i log line by line it shows all the lines so the pipe works fine. – Mederic Sep 17 '21 at 05:48
  • Can you compare the exact strings what is read from the file? Maybe some encoding issues? Can you `console.log` the byte sequences and then check if they are the exact same if you copy the files contents into a string? – miile7 Sep 20 '21 at 06:26
  • @miile7 bytes were an exact match. I have found that if I try the string by itself its says no match but regex 101 says it is a match (updated question) – Mederic Sep 20 '21 at 06:56
  • I tried the last code block in the browser (Firefox) and it matches too. So maybe it has to do something with the regexp engine? Is it different in node (I thought it's standardized) – miile7 Sep 20 '21 at 07:01
  • @miile7 i guess it must come from the node regex engine but at this point I have no idea on what or how it could be – Mederic Sep 20 '21 at 07:03
  • why would you do that with regex if you know data is separated by `|`? – Vulwsztyn Sep 20 '21 at 07:09
  • @Vulwsztyn because the order of the data is not fixed. Hence using groups to identify what I need. Also various documents take various delimiters so I have a dynamic regex builder for the fields needed. – Mederic Sep 20 '21 at 07:10
  • Nodes is based on [Chrome V8 engine](https://v8.dev/) so I assume that the regexp engine is also taken from there. Maybe you can find out some differences from that? – miile7 Sep 20 '21 at 07:12
  • I still think it is not the best approach, IMO you should first split the data into arrays of strings and then identify which string contains which field. Parsing structued data with regex is always a bad idea https://stackoverflow.com/a/1732454/7195666 – Vulwsztyn Sep 20 '21 at 07:14
  • @Vulwsztyn don't see how that relates to an issue in the Regex Engine of nodejs. I am also not parsing HTML or XHTML and i'm parsing simple CSVs. splitting also doesnt work as some regex matched some fields and the data can be very vague. Furthermore when benchmarked splitting fields + running all regex and havening multiple positive matches on a value took way longer to analyse than regex groups. the first POC was using splits but the data is too irregular for it to work. – Mederic Sep 20 '21 at 07:22
  • Can you please post the file you are testing this one? What you are likely suffering from is a difference in line breaks, but this cannot be confirmed without said file – Jacob Sep 20 '21 at 16:45

1 Answers1

4

Look at the accepted answer of this question: RegExp is Stateful

Essentialy, your regex is an object that preserves the index in a line where it found the last match, and the next time it continues from there instead of looking for a match from the start of the line again.

So one solution is to manually reset regex.lastIndex with your every call to es.MapSync

like this:

let s = fs.createReadStream(filePath)
    .pipe(es.split())
    .pipe(es.mapSync(function (line) {
            regex.lastIndex = 0; //Reset the RegExp index
            let values = regex.exec(line.trim())
            if (values && values.groups) {
                console.dir(values.groups)
            } else {
                console.log(`COULD NOT MATCH`)
                console.log(line)
            }
        }).on('error', function (err) {
            console.log('Error while reading file.', err);
        })
            .on('end', function () {
                console.log('Read entire file.')
            })
    )

Mind you, this only happens because regex is defined globally. If you were to assign the regex inside the mapSync() callback it should have the same effect. However, resetting the lastIndex is simpler and probably more performant.