0

I'm trying to build a regex for my NodeJS (12.8.0) project that fetches the plaintext content of emails out of the .eml files of spam emails (building a simple spam filter for fun).

For this, I have written this regex:

[-]{14}[0-9]*\s.+[\s]+.+(?:[\s]*)([\s\S]+)[\s]{3}[-]{14}[0-9]+[\r\n]

When I use this regex in NodeJS, however, I get a value of null instead of the content of the mail.

const regexp = new RegExp("[-]{14}[0-9]*\s.+[\s]+.+(?:[\s]*)([\s\S]+)[\s]{3}[-]{14}[0-9]+[\r\n]");
let matches = content.match(regexp);
console.log(matches);

I have added my regex on regex101.com and it works mostly fine but interestingly enough, it tells me that it found a group Group 1 and shows the right content... but doesn't show what lines (like with the Full Match).
Now to add some more interesting stuff, when I swap it to PCRE, it works perfectly fine (and even shows the lines).
Please do note that the demo on regex101 is containing an actual sample mail.

EDIT: As per @CertainPerformance's suggestion, I have updated the code to the following, unfortunately, this returns false instead of true:

const regexp = /[-]{14}[0-9]*\s.+[\s]+.+(?:[\s]*)([\s\S]+)[\s]{3}[-]{14}[0-9]+[\r\n]/;
let matches = regexp.test(content);
console.log(matches); // false

as well as the following, which still returns null:

const regexp = /[-]{14}[0-9]*\s.+[\s]+.+(?:[\s]*)([\s\S]+)[\s]{3}[-]{14}[0-9]+[\r\n]/;
let matches = content.match(regexp);
console.log(matches); // null

EDIT 2: Tested the regex in PHP and it works perfectly fine... Seems like something must be derping out...

EDIT 3: Adding the entire snippet of code in the hoped someone could spot the issue...

const pattern = /[-]{14}[0-9]+[\s].+[\s]+.+(?:[\s]*)([\s\S]*)[\s]{3}[-]{14}[0-9]+[\r\n]/;
const spamFolder = './datasets/spam/';
fs.readdir(spamFolder, (err, files) => {
  if (err) return console.log('Unable to scan directory: ' + err);

  // Loop over each file
  files.forEach(file => {
    // Read the file
    var contents = fs.readFileSync(spamFolder + file, 'utf8');
    var matches = contents.match(pattern);
    console.log(matches); // null
  });
});
Finlay Roelofs
  • 533
  • 6
  • 21
  • [Don't use `new RegExp` unless you need to construct a regex dynamically](https://stackoverflow.com/questions/17863066/why-do-regex-constructors-need-to-be-double-escaped#answer-55793086) – CertainPerformance Aug 10 '19 at 23:14
  • @CertainPerformance While it is good feedback, it doesn't affect the issue at all. – Finlay Roelofs Aug 10 '19 at 23:16
  • Sure it does, that's exactly the problem you're running into - since you're using the `new RegExp` constructor, you must double-escape the backslashes, but you aren't. Avoid `new RegExp` and the match will work, just like it does on the demo sites you're looking at (assuming that your flags and such match up). – CertainPerformance Aug 10 '19 at 23:18
  • The difference between the PCRE and JS visuals on regex101 is just that, visual only - if you see a group with PCRE enabled, and the syntax works in JS as well, that same group with the same position will be matched when you actually run the regex in JS. For some reason, regex101 doesn't highlight the different matched groups when JS is selected, but that doesn't mean the pattern isn't matching those groups. (consider it to be a regex101 visual bug) – CertainPerformance Aug 10 '19 at 23:22
  • It doesn't. affect nor solve the issue. the demo sites I'm looking at show that something is going wrong. I've tried using the literal and it doesn't change the output at all. Also, this shouldn't matter that much since the output on regex101.com is also not right. (Just take a look at it, it matches but doesn't actually capture like it does with the `PCRE` flavor) – Finlay Roelofs Aug 10 '19 at 23:22
  • I tried running your regex in JS (without `new RegExp`) and it appears to match as desired – CertainPerformance Aug 10 '19 at 23:23
  • I have updated my NodeJS code (using the literals and using the `.test()` method) and it now returns `false` (instead of an expected `true`). using `content.match(regexp)` still returns `null`. – Finlay Roelofs Aug 10 '19 at 23:27
  • 1
    Works fine here https://jsfiddle.net/nukc86a0/ – CertainPerformance Aug 10 '19 at 23:39
  • Hm... this is odd... it does seem to work fine in that fiddle yea... – Finlay Roelofs Aug 10 '19 at 23:51

0 Answers0