1

I would like to retrieve the shortest match from a long text where strings are repeated throughout the text. However, matches within text that has already been matched aren't being found.

Here's a simplified version of the issue I'm facing:

  • Code: "ababc".match(/a.+c/g)
  • Observed result: ["ababc"]
  • Expected result: ["ababc", "abc"]

Therefore I'm wondering whether there is an easier way to retrieve the substring "abc" than manually writing recursive code to search within matches.

greiner
  • 586
  • 2
  • 11
  • 1
    How would you handle `ababcabc`, do you want all possible permutations or is the usual [overlapping match handling](http://stackoverflow.com/questions/8235929/how-can-i-get-a-regex-to-find-every-match-in-javascript) enough? – Sebastian Proske Feb 19 '17 at 00:43
  • I don't think this problem can be solved with regex alone. The regex engine only gives up matched characters when it has to, to satisfy the rest of the pattern. "This match is shorter" isn't part of that criteria – CrayonViolent Feb 19 '17 at 00:45
  • @SebastianProske The ultimate goal is to find the shortest match so for that it only needs to find `["abc", "abc"]` in your example - no need for all permutations. Modifying the "lastIndex" property, as mentioned in one of the answers to the question you linked to, looks quite good so thanks for the link. – greiner Feb 19 '17 at 01:57

3 Answers3

1

As mentioned in my comment, you can't do what you want with regex alone.

You gave a simplified example, so I'm not sure how far this will take you, but here is my stab at doing what you are looking for. I have a sneaking suspicion your "a" and "c" characters are not the same, so you will need to modify this accordingly (e.g. pass them as arguments to the function).

function getShortestMatch(str) {
  var str = str || ''; 
  var match, 
  index, 
  regex,
  length,
  results = [];
  // iterate along the string one character at a time
  for (index = 0, length = str.length; index < length; index++) {
    // if the current character is 'a' (the beginning part of our substring match)
    if (str[index] === 'a') {
      // create a new regex that first consumes everything up to 
      // the starting character. Then matches for everything from there to 
      // the ending substring char 'c'. It is a lazy match so it will stop 
      // at the first matched ending char 'c'
      regex = new RegExp('^.{' + index + '}(a.+?c)');
      match = str.match(regex);
      // if there is a match, then push to the results array
      if (match && match[1]) {
        results.push(match[1]);
      }
    }
  }
  // sort the results array ascending (shortest first)
  results.sort(function(a,b){
    return a.length - b.length;
  });

  // log all results matched to the console for sake of example
  console.log(results); 

  // return the first (shortest) element
  return results[0];
}

Example

getShortestMatch('ababcabbc');

// output showing all results found (from console.log in the function)
["abc", "abbc", "ababc"]

// return value
"abc"

Note: This function does not attempt to find all possible matches to "everything between an 'a' and a 'c'", since your question was about finding the shortest one. If for some reason you want all possible matches to that, then a greedy .+ regex would be thrown into the mix.

CrayonViolent
  • 32,111
  • 5
  • 56
  • 79
  • Obviously bit irrelevant because your output includes the correct result of `abc`... but just wondering why you think there is no match for `ababcabbc` with your example ? – Robin Mackenzie Feb 19 '17 at 01:47
  • @RobinMackenzie Yes, if you were to make a list of all possible values for everything inbetween an "a" and a "c" for "ababcabbc" then yes, the full string would be on the list. My note was to state that my function will not match for that (because it only does a lazy match for shortest values at each index) – CrayonViolent Feb 19 '17 at 02:21
0

Loop through substrings starting from each successive character (using slice), matching against a regexp which is anchored to the start of the string (^), and uses non-greedy matching (?):

const input = "ababc";
const regexp = /^a.+?c/;

const results = [];
    
for (var i = 0; i < input.length; i++) {
  var match = input.slice(i).match(regexp);
  if (match) results.push(match[0]);
}

console.log("all results are", results);
var shortest = results.sort((a, b) => a.length - b.length)[0];
console.log("shortest result is", shortest);
0

This is the answer I went with due to its effectiveness, simplicity and efficiency:

let seq = "us warship";
let source = "The traditional US adversary has also positioned a spy ship off the coast of Delaware and carried out flights near a US Navy warship, concerning American officials.";

let re = new RegExp(`\\b${seq.replace(/\s/g, "\\b.+?\\b")}\\b`, "gi");
let snippet = null;
let matches;
while (matches = re.exec(source)) {
  let match = matches[0];
  if (!snippet || match.length < snippet.length) {
    snippet = match;
  }
  re.lastIndex -= (match.length - 1);
}
console.log(snippet); // "US Navy warship"

Source: https://stackoverflow.com/a/8236152/1055499

Community
  • 1
  • 1
greiner
  • 586
  • 2
  • 11