How to match a text with a token an expression with/without negative lookahead in JavaScript Regex

Question

Supposed to have a comma separated string of text, where each text has or not - comma separated - a token in a list like

var tokens=['Inc.','Ltd','LLC'];

so the string is like

var companies="Apple, Inc., Microsoft, Inc., Buzzfeed, Treasure, LLC";

I want to obtain this array as output

var companiesList = [
    "Apple Inc.",
    "Microsoft Inc.",
    "Buzzfeed",
    "Treasure LLC"
    ];

So I firstly did a RegExp like that

var regex=new RegExp("([a-zA-Z&/? ]*),\\s+("+token+")", "gi" )

that I get the matches and search for a regex like

var regex=new RegExp("([a-zA-Z&/? ]*),\\s+("+item+")", "i" )

for each of the tokens:

tokens.forEach((item) => {
    var regex = new RegExp("([a-zA-Z&/? ]*),\\s+(" + item + ")", "gi")
    var matches = companies.match(regex) || []
    console.log(item, regex.toString(), matches)
    matches.forEach((m) => {
        var regex = new RegExp("([a-zA-Z&/? ]*),\\s+(" + item + ")", "i")
        var match = m.match(regex)
        if (match && match.length > 2) {
            var n = match[1].trim();
            var c = match[2].trim();
            companiesList.push(n + ' ' + c);
        }
    });
});

In this way I can capture the tokens and concat matching groups 1 and 2.

var tokens = ['inc.', 'ltd', 'llc'],
  companies = "Apple, Inc., Microsoft, Inc., Buzzfeed, Treasure, LLC",
  companiesList = [];
tokens.forEach((item) => {
  var regex = new RegExp("([a-zA-Z&/? ]*),\\s+(" + item + ")", "gi")
  var matches = companies.match(regex) || []
  console.log( item, regex.toString(), matches )
  matches.forEach((m) => {
    var regex = new RegExp("([a-zA-Z&/? ]*),\\s+(" + item + ")", "i")
    var match = m.match(regex)
    if (match && match.length > 2) {
      var n = match[1].trim();
      var c = match[2].trim();
      companiesList.push(n + ' ' + c);
    }
  });
});

console.log(companiesList)

The problem is that I'm missing the comma separated text without a token after the comma like: Buzzfeed.

The idea is to use a non capturing group in a negative look ahead ( see here about non capturing groups in regex match)

/([a-zA-Z]*)^(?:(?!ltd).)+$/gi

But in this way I have any match when in the input string the token is present:

"Apple, Inc., Microsoft, Inc., Buzzfeed, Treasure LLC".match( /([a-zA-Z]*)^(?:(?!llc).)+$/gi )

while I want to match only the text that do not have it so I would like to get - like the opposite before:

["Buzzfeed"]

So how to negate/modify the previous code to work in both cases to obtain at the end the composed array:

var companiesList = [
        "Apple Inc.",
        "Microsoft Inc.",
        "Buzzfeed",
        "Treasure LLC"
        ];

You misunderstand the answer in the popular SO question about matching a string not containing a word. You need `(?!ltd|etc)` lookahead where you may add alternatives after a pipe. — Wiktor Stribiżew, Nov 03 '16 at 16:36
@WiktorStribiżew uhm it's possible, but check the code, and try with that I have some patterns to respect like `Name, Inc.`. So I have to match this pattern and the latter (without). — loretoparisi, Nov 03 '16 at 16:37
To just match Buzzfeed, you need to exclude matching those `LLC`, etc. and also all words that are followed with them. [This](https://jsfiddle.net/wav6gaob/) does not look nice. Maybe adaneo is suggesting a better way out. — Wiktor Stribiżew, Nov 03 '16 at 16:40

adeneo · Accepted Answer · 2016-11-03T16:41:22.787

1

Wouldn't it be a lot easier to just reduce it, and just check the token list as you go

var tokens    = ['Inc.','Ltd','LLC'];
var companies = "Apple, Inc., Microsoft, Inc., Buzzfeed, Treasure, LLC";

var result    = companies.split(',').reduce( (a,b,i) => {
    return tokens.indexOf(b.trim()) === -1  ? a.push(b.trim()) : a[a.length-1] += b,a;
}, []);

console.log(result);

edited Nov 03 '16 at 16:41

answered Nov 03 '16 at 16:38

adeneo

312,895
29
395
388

hahaha too much `RegExp` overhead in my mind. I assume that your solution it should work pretty good in most of cases! +1. will try it in my scenary but it seems extremely clever. – loretoparisi Nov 03 '16 at 16:41
1

It's really just a suggestion, but it seemed a lot easier than that regex nightmare, and yes, it should work with anything, as long as the value is in the tokenlist, and would be easy to make it case-insensitive, trim of whitespace, or anything else you'd need. – adeneo Nov 03 '16 at 16:42
Absolutely thanks, since there are odd cases like "Inc." and "inc", etc, at some point maybe it is better to have all in, and the regex nightmare is true! – loretoparisi Nov 03 '16 at 17:27
1

Also, [here it is](https://jsfiddle.net/w6nxuzmq/) a little less golfed, probably easier to work with. – adeneo Nov 03 '16 at 17:43
thanks. It seems it only fails when there is a trailing `","` like `"Apple Inc., Facebook, "`, so it will push a `""` in the array, see [here](https://jsfiddle.net/02me7cm0/) – loretoparisi Nov 04 '16 at 17:22
1

Sure, if you change the string to a different format, where you have commas other places as well, it won't work as expected. In this case it should be easy to filter out the empty string though -> https://jsfiddle.net/02me7cm0/1/ – adeneo Nov 04 '16 at 19:27

score 1 · Answer 2 · answered Nov 03 '16 at 16:45

1

You could use a regex for splitting.

var companies = "Apple, Inc., Microsoft, Inc., Buzzfeed, Treasure, LLC";

console.log(companies.split(/,\s(?!Inc\.|Ltd|LLC)/i).map(s => s.replace(', ', ' ')));

answered Nov 03 '16 at 16:45

Nina Scholz

376,160
25
347
392

This also works, but how to apply to an array of tokens of arbitrary length? – loretoparisi Nov 03 '16 at 16:51

How to match a text with a token an expression with/without negative lookahead in JavaScript Regex

2 Answers2