Matching the last best word using regex js

Question

I have a regex for matching the text within the brackets. For example the regex https://regex101.com/r/TvweUj/3

/\b(\w)[-'\w]* (?:[-",/\\*&'\w]* ){1,}\(\1[A-Z]{1,}\)/gi

matches MIDI The USB Device Class Definition for MIDI Devices transmits Music Instrument Digital Interface (MIDI). instead of only matching the last 4 words Music Instrument Digital Interface

How do I change my regex to match the recent matching instead from the MIDI The USB Dev*****

The fourth bird · Answer 1 · 2020-08-03T10:02:26.903

1

You might use 4 capturing groups with a positive lookahead asserting 4 backreferences to match the uppercase chars between the parenthesis:

\b([A-Z])\w+ ([A-Z])\w+ ([A-Z])\w+ ([A-Z])\w+(?= \(\1\2\3\4\))

Regex demo

Instead of using \w only, you could use the character classes that you use in the question like [-",/\\*&'\w]*

A more broad pattern could be repeating an uppercase char followed by 1+ word chars \w+ (or use \w* to repeat 0+ word chars) and assert that what follows is only uppercase chars between parenthesis.

\b[A-Z]\w+(?: [A-Z]\w+)*(?= \([A-Z]+\))

Regex demo

If the number of chars are variable that you want to match between the parenthesis and they should match with the number of words before, you could use 2 capturing groups and compare the amount of splitted words with the number of uppercase chars between the parenthesis.

let pattern = /\b([A-Z][a-z]*(?: [A-Z][a-z]*)*) \(([A-Z]+)\)/;
let compare = (ar1, ar2) =>
  ar1.length === ar2.length && ar1.every(
    (value, index) => value === ar2[index].charAt(0)
  );
[
  "transmits Music Instrument Digital Interface (MIDI).",
  "transmits Music Instrument Digital Interface (MADI).",
  "transmits Music Instrument Digital Interface (MID)."
].forEach(s => {
  let m = s.match(pattern);
  let res = compare(m[2].split(''), m[1].split(' ')) ? "Ok -> " : "Not ok -> ";
  console.log(res + s);
})

edited Aug 03 '20 at 10:02

answered Aug 03 '20 at 06:47

The fourth bird

154,723
16
55
70

The first one works only if there are 4 characters inside(....) right ? – rootkonda Aug 03 '20 at 06:57
@rootkonda That is correct, and the 4 characters have to be the same same as those captured in the groups. – The fourth bird Aug 03 '20 at 06:58
ok. I was looking for a way to generalize this for any character length inside( ). May be its not feasible just using regex unless we combine this with some code ? – rootkonda Aug 03 '20 at 07:03
@rootkonda If you want to match the exact number of chars inside the parenthesis, you could for example get the match outside the parenthesis and split on a space and count the number of items to be equal to the number of chars inside it. Using for example 2 capturing groups https://regex101.com/r/yGIz8w/1 – The fourth bird Aug 03 '20 at 07:07
@Thefourthbird do you mind me asking why there a `?:` non capturing group – Sven.hig Aug 03 '20 at 07:22
@Thefourthbird - If the text contains upper case letters before M then that also will get matched isn't it ? if the given string has "MIDI Devices Transmits Music Instrument Digital Interface (MIDI)", then the match will start from MIDI. – rootkonda Aug 03 '20 at 07:29
@rootkonda It will be matches, but that can be prevented by matching only lowercase chars a-z like `\b([A-Z][a-z]*(?: [A-Z][a-z]*)*) \(([A-Z]+)\)` https://regex101.com/r/3GgdPf/1 – The fourth bird Aug 03 '20 at 07:38
@Sven.hig There is a non capturing group `(?: [A-Z]\w+)*` to repeat what is inside as a whole. – The fourth bird Aug 03 '20 at 07:39
@Thefourthbird thank you for your response, but forgive my ignorance, when i remove `?:` nothing changes and i thought a non capturing group is to eliminate a group from showing in the match result am I wrong ? can you elaborate more? – Sven.hig Aug 03 '20 at 08:17
1

@Sven.hig If you remove the non capturing group like this `\b[A-Z]\w+ [A-Z]\w+(?= \([A-Z]+\))` you will only match [Digital Interface](https://regex101.com/r/FKVcc9/1). If you turn the non capturing group into a capturing group `\b[A-Z]\w+( [A-Z]\w+)*(?= \([A-Z]+\))` you will capture the value of the last iteration which is [Interface](https://regex101.com/r/GjJet6/1). It will work but you don't need the capturing group at all for the full match. – The fourth bird Aug 03 '20 at 08:25
I got it and thank you very much for your time, I will highly appreciate it if you can recommend any good resources to read more about this – Sven.hig Aug 03 '20 at 08:48
1

@Sven.hig No problem, these pages are for example a good read about groups: https://www.rexegg.com/regex-disambiguation.html#noncap and https://www.regular-expressions.info/captureall.html and https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions – The fourth bird Aug 03 '20 at 09:18

Matching the last best word using regex js

1 Answers1