Extract wildcard string (* asterisk only) from a set of strings/texts?

Question

Is there an algorithm or code/lib (NodeJS preferably), to extract wildcard string from a set of strings or texts.

For example, following are the set of strings as input (forgive my english):

apple pie is not too so good
apple pie is not so good
apple pie is too good

I shall be able to extract a wildcard string with asterisk * only, like this:

apple pie is * good

There are other special characters in wildcard, but I want to use only * asterisk.

In a way we can say that I need to extract regex from the input strings. Now is there a library or algorithm to achieve this?

Please let me know if more information required.

You could extract whatever you need form your string using regex. Do you want a regex replacement using new characters like your "*" ?? — JRichardsz, May 30 '23 at 21:49
Do you mean that you want one string with wildcards that match all of the input strings? If so how do you decide which matching string is the best - as `*` or `apple *` would also be valid (but less specific), whereas `apple pie is *o*o good` is even more specific. — Hans Olsson, May 31 '23 at 08:20
@savaratkar Your question needs more clarity. It's very difficult to tell what you're looking for. Are you looking for an algorithm or a regex? Is this an exercise you're working on? (if so please provide all details of the exercise otherwise it will be very difficult to help) https://stackoverflow.com/help/how-to-ask — Andria, May 31 '23 at 16:53
@JRichardsz yes pretty much that. Even if I get regex from the input strings, that will do my job. — Savaratkar, Jun 02 '23 at 04:21

Scott Sauyet · Answer 1 · 2023-06-06T13:19:48.220

I'm going to make the assumption that you want to extract a single wildcard that will work for all your strings. If not, I don't know how to get started.

Also, for this first pass, I will correct your English, changing "apple pie is not too so good" to "apple pie is not very good". The reason will be clear below.

To solve this, we can break it down to finding a common prefix to all the strings, and a common suffix to them all, and joining these together with our * wildcard character. It might look like this:

const reverse = (s) => [...s] .reverse () .join ('')

const commonPrefix = ([first, ...rest]) =>
  first .slice (0, [... first] .findIndex ((c, i) => rest .some (s => s .at (i) !== c)))

const commonSuffix = (ss) =>
  reverse (commonPrefix (ss .map (reverse)))

const extractWildcard = (str) => 
  commonPrefix (str) + '*' + commonSuffix (str)


console .log (extractWildcard ([
  'apple pie is not very good', 
  'apple pie is not so good', 
  'apple pie is too good'
])) //=> 'apple pie is * good'

Our commonPrefix function finds the first index at which some string is different from the first one, then slices the string up to that point. commonSuffix uses a string reverse helper function and reverses all the strings, finds their common prefix and reverses that value. (There are many other ways of doing this, some of them much more efficient, but they involve calculating the longest string to start, and are all more complex. If there was a performance problem here, I would look to use one of them instead.)

But there is a problem here, and it's shown by using your original "apple pie is not too so good":

extractWildcard ([
  'apple pie is not too so good', 
  'apple pie is not so good', 
  'apple pie is too good'
]) //=> "apple pie is *o good"

The extra o next to the wildcard is unwanted. To fix that, we might make a more complex version of extractWildcard that deals with spaces:

const extractWildcard2 = (s, pre = commonPrefix (s), suf = commonSuffix (s)) => 
   pre .slice (0, pre .lastIndexOf (' ')) + ' * ' + suf .slice (suf .lastIndexOf (' ') + 1)


extractWildcard2 ([
  'apple pie is not too so good', 
  'apple pie is not so good', 
  'apple pie is too good'
]) //=> "apple pie is * good"

Update

Here is a more efficient version of commonSuffix:

const commonSuffix = (ss, length = Math .max (...ss .map (s => s .length))) =>
  (ss [0] || '') .slice (length - Array .from ({length}, (_, i) => i + 1) .findIndex (
    (i) => console .log ({i}) || ss .some (s => s .at (-i) !== ss [0] .at (-i))
  ))

I would stick with the simpler one unless this proved to be an actual bottleneck in an application, though.

Update 2: Simplicity

(This is not about the actual problem, but a discussion on code simplicity; please feel free to ignore it.)

A comment said:

It is pretty much unreadable code even for me [...]. simpler, more readable algorithms do exist for common prefix.

I would like to argue this point. Rich Hickey has a famous talk, Simple Made Easy, which distinguishes "simple" from "easy". Simple is a fairly objective measure about how few concepts are twined together to provide a solution. Easy is a subjective measure of how familiar the solution seems to a particular person.

By writing declarative code, by avoiding techniques like mutable state, by using expressions rather than statements, and by avoiding temporal effects of variable assignment, I'd argue that this code is fairly simple. It may not be familiar, and therefore may be less easy for some users.

I wrote the code above as it stands; I didn't start somewhere and whittle it down to a minimum. But we can try that. We can start with the most imperative -- and to some, most familiar -- style of code:

const commonPrefix = (strings) => {
  const firstString = strings [0];
  const first = firstString .split ('');
  const rest = strings .slice (1);
  let index = -1;
  let indexFound = false;
  for (let i = 0; i < first .length; i ++) {
    const character = first [i];
    for (let j = 0; j < rest .length; j ++) {
       const testString = rest [j];
       const testCharacter = testString [i]
       if (testCharacter !== character) {
         index = i;
         indexFound = true;
         break;
       }
    }
    if (indexFound) {
      break;
    }
  }
  const characters = first .slice (0, index); 
  const result = characters .join ('');
  return result;
}

This may be more familiar/easy to some. But let's look at the implementation. We have twelve local variables. We use a pair of nested loops, each including an if statement. We perform early escapes in both loops in certain circumstances.

If we do a slight bit of cleanup, and recognize that the outer loop is just doing the job that Array.prototype.findIndex already does for us, then we can write what I would argue is a simpler version:

const commonPrefix = (strings) => {
  const firstString = strings [0]
  const first = firstString .split ('')
  const rest = strings .slice (1)
  const index = first .findIndex ((c, i) => {
    for (let j = 0; j < rest .length; j ++) {
       const testString = rest [j]
       if (testString [i] !== c) {
         return true
       }
    }
    return false
  })
  return first .slice (0, index) .join ('')
}

We've also eliminated some local variables: testCharacter, characters, and result, with no detriment in readability.

Once again, we should be able to notice that the remaining loop is performing the same job as Array.prototype.some. We can clean that up to leave:

const commonPrefix = (strings) => {
  const first = strings [0] .split ('')
  const rest = strings .slice (1)
  const index = first .findIndex ((c, i) => rest .some (s => s .at (i) !== c))
  return first .slice (0, index) .join ('')
}

We're now approaching what I think of as a simple function. But we can notice that index is defined in one line and used only once, in the very next line, so if we choose, we can simply fold it in, and remove the local variable. Similarly, the parameter strings is only used to define first and rest; we can achieve the same with parameter destructuring, and skip these definitions. At this point, we get back to my initial version:

const commonPrefix = ([first, ...rest]) =>
  first .slice (0, [... first] .findIndex ((c, i) => rest .some (s => s .at (i) !== c)))

Note that this version doesn't use first .split (''), but instead [... first]. Either will work; and either seems equally simple. This code uses s .at (i) instead of s [i], since an earlier verion had a parallel commonSuffix where the at was much easier to use. But we could change that; either choice seems fine.

In all, this code is much simpler than the longest version. It's not the length itself which is important. But a common side-effect of simpler code is shorter code.

Update 3: White-space style

A comment found that my white-space styling detracts from readability. Just for comparison, here's the same code in a more traditional formatting:

const reverse = (s) => [...s].reverse().join('')

const commonPrefix = ([first, ...rest]) =>
  first.slice(0, [... first].findIndex((c, i) => rest.some(s => s.at(i) !== c)))

const commonSuffix = (ss) =>
  reverse(commonPrefix(ss.map(reverse)))

const extractWildcard = (str) => 
  commonPrefix(str) + '*' + commonSuffix(str)


console.log(extractWildcard([
  'apple pie is not very good', 
  'apple pie is not so good', 
  'apple pie is too good'
])) //=> 'apple pie is * good'

I find that over-dense, but I know many others prefer it.

It is pretty much unreadable code even for me, let alone the op. simpler, more readable algorithms do exist for common prefix. — Salman A, May 31 '23 at 21:53
@SalmanA: Added a long update discussing code simplicity and alternate versions. Note that my comment on your answer (and I wasn't the downvoter!) was to note that we have very different understanding of the requirements. — Scott Sauyet, Jun 01 '23 at 14:30
You have a peculiar code formatting style (putting whitespace before the `.` member access operator and before the `()` call syntax is something I've never seen a JavaScript programmer do), which makes your code *harder* to read, but it is certainly simple :) — Domino, Jun 05 '23 at 06:39
@Domino: I keep hoping this style will catch on, but so far no takers! I find it a more deliberated style, adding some slight separation to the different parts of the code. I think I'm going to have to move away from it as it does seem to cause some consternation, but I really do prefer it. Ah well... Added an update with more traditional formatting. — Scott Sauyet, Jun 06 '23 at 13:16
@ScottSauyet Haha, I feel you. Some people hate the way I write my code when it's only up to me too, I don't even put head and body tags in my HTML, the horror! — Domino, Jun 06 '23 at 23:33

score 3 · Accepted Answer · answered Jun 03 '23 at 12:36

I'll assume that the words of the input are considered atomic and should not be broken up. Otherwise a solution for your example input could have kept more original characters in the output:

apple pie is *o*o good

I'll then also assume that these words are separated by spaces and not by other characters, and so a first step would be to split the input phrases into arrays with their words as elements.

Structure of the problem

You could see this as a graph problem, where the nodes are the combined state of which words have already been processed from each phrase, or otherwise put: the tuple of current word-indicies in those phrases. The edges are directed and weighted and represent the "cost" of progressing from one state to another. The cost includes a point for every word that is consumed by a wildcard. If there are multiple solutions with the same cost, then prefer the one(s) that have fewer wildcards. So we could account 0.5 cost for every wildcard that is introduced.

For example: if the words in the input are all different, then the solution is "*", and the cost of this solution is the total count of words plus 0.5 for the single wildcard.

Another example is where all input phrases are exactly the same, like 4 times "this is a phrase", and so that phrase will also be the output. The cost of that solution is 0, as no words were omitted by a using wildcard(s).

Best Search algorithm

Then perform Dijkstra's algorithm on this graph to find a solution, i.e. the end state where all word-indices are at the end of their corresponding phrases.

We can eliminate some search paths by focusing on those where no two consecutive wildcards will be needed. So we should aim to increment the indices in such a way that after a wildcard has been used, the words referenced by the updated indices are all the same. If we decide to use a phrase's current word literally (and not as wildcard), we should find the same word in the other strings (from their current index onwards). To cope for when this is not possible, or still leads to a high cost, we should also consider the scenario where all current words are skipped (all indices increment by 1), using a wildcard.

Priority Queue

As JavaScript has no native priority queue, I have reused the heap implementation from this answer.

Implementation

Here is the implementation of the above described algorithm:

// From https://stackoverflow.com/a/66511107/5459839
const MinHeap={siftDown(h,i=0,v=h[i]){if(i<h.length){let k=v[0];while(1){let j=i*2+1;if(j+1<h.length&&h[j][0]>h[j+1][0])j++;if(j>=h.length||k<=h[j][0])break;h[i]=h[j];i=j;}h[i]=v}},heapify(h){for(let i=h.length>>1;i--;)this.siftDown(h,i);return h},pop(h){return this.exchange(h,h.pop())},exchange(h,v){if(!h.length)return v;let w=h[0];this.siftDown(h,0,v);return w},push(h,v){let k=v[0],i=h.length,j;while((j=(i-1)>>1)>=0&&k<h[j][0]){h[i]=h[j];i=j}h[i]=v;return h}};

function commonPatternFor(strings) {
    const makeKey = JSON.stringify;
    // When there is a tie in the number of removed words, the number of wildcards is minimised:
    const WILDCARD_COST = 0.5; 
    
    // Words are considered atomic:
    const phrases = strings.map(s => s.split(" "));
    // The end state to achieve:
    const end = phrases.map(s => s.length);
    // Heap has entries with [cost, current indexes, previous indexes, isLastStepWildCard]  
    const heap = [ [0, phrases.map(() => 0), null, WILDCARD_COST] ];
    // Register the optimal paths as linked list
    const comeFrom = new Map;
    
    
    while (true) { // Best-first Loop
        // Extract best state so far
        let [cost, curr, prev, wildcardCost] = MinHeap.pop(heap);
        const currKey = makeKey(curr);
        if (comeFrom.has(currKey)) continue; // Already visited this state
        comeFrom.set(currKey, prev);

        if (phrases.some((words, i) => curr[i] === words.length)) {
            // Some phrases have been fully accounted for. Count number of remaining words:
            const diff = phrases.reduce((acc, words, i) => acc + words.length - curr[i], 0);
            if (!diff) { // BINGO!
                // unwind solution
                const sol = [];
                while (true) {
                    const prev = comeFrom.get(makeKey(curr));
                    if (!prev) return sol.reverse().flat().join(" ");
                    const sliceArr = phrases[0].slice(prev[0], curr[0]);
                    const slice = sliceArr.join(" ");
                    if (phrases.every((words, i) => words.slice(prev[i], curr[i]).join(" ") === slice)) {
                        sol.push(sliceArr);
                    } else if (sol.at(-1) != "*") {
                        sol.push("*");
                    }
                    curr = prev;
                }
            }
            // As some phrases have no more remaining words, we use a single wildcard for
            //    covering the remaining words in the other phrases
            MinHeap.push(heap, [cost + diff + wildcardCost, end, curr, 0]); 
        } else {
            // See if there are common words in all phrases starting at their current positions
            let k = 0;
            while (true) {
                const word = phrases[0][curr[0]+k];
                if (word === undefined || !phrases.every((words, i) => words[curr[i]+k] === word)) break;
                k++;
            }
            if (k) { // Yes, all phrases have k common words starting at their current positions
                MinHeap.push(heap, [cost, curr.map(i => i + k), curr, WILDCARD_COST]); 
                // Also consider alternative scenarios where we introduce a wildcard
                phrases.forEach((words, i) => {
                    const k = words.indexOf(words[curr[i]], curr[i] + 1);
                    if (k < 0) return;
                    const diff = k - curr[i];
                    const next = [...curr];
                    next[i] = k;
                    MinHeap.push(heap, [cost + diff + wildcardCost, next, curr, 0]); 
                });
            } else { // Not all current words are the same: we must use a wildcard
                phrases.forEach((words, i) => {
                    const word = words[curr[i]]; // Try to find this word in the other phrases
                    let diff = 0;
                    const next = phrases.map((words2, j) => {
                        let k = words2.indexOf(word, curr[j]);
                        diff += k - curr[j];
                        return k;
                    });
                    if (!next.includes(-1) && diff) {
                        MinHeap.push(heap, [cost + diff + wildcardCost, next, curr, 0]); 
                    }
                });
                // Also consider the scenario where a wildcard is used to cover for ALL current words
                MinHeap.push(heap, [cost + phrases.length + wildcardCost, curr.map(i => i + 1), curr, 0]); 
            }
        }
    }
}

const tests = [
    [
        "apple pie is not very good", 
        "apple pie is not so good", 
        "apple pie is too good"
    ], [
        "apple pie is not very good and for me that is fine", 
        "the apple pie is not as good as they told me", 
        "apple pie is too good for me and you"
    ], [
        "this is such a very common old phrase this is a common phrase", 
        "this is a common phrase while this is not common", 
        "this is a common phrase",
    ], [
        "a b c d e f g h",
        "i j k l m n o p",
        "q r s t u v w x",
        "y z a b c d e f",
        "g h i j k l m n",
        "o p q r s t u v",
        "w x y z a b c d",
    ], [
        "a a a b a a b b a a b a",
        "a a a a a b a b a a b b",
        "a a b a a a b a b a a b"
    ]
];

for (const test of tests) {
    const sol = commonPatternFor(test);
    console.log(sol);
}

This algorithm does not have a good worst-case time complexity, so if you feed it with more and longer phrases, with lots of common words, it will become too slow.

Interesting. Three different answers; three different interpretations of the requirements. I'm trying to figure out if a variant of my approach would fit your interpretation. Nothing yet, but I'll keep thinking. — Scott Sauyet, Jun 04 '23 at 00:34
This is exactly what I was looking for. I am sorry, I couldn't accepted your answer on time for you to get the bounty. But I will buy you a ☕ — Savaratkar, Jun 07 '23 at 09:22
Bounty passed along (and increased). This is a very nice answer! — Scott Sauyet, Jun 08 '23 at 14:10
Definitely worth it! I thought I had the correct interpretation of the problem, so I didn't focus too much on your alternate interpretation, which turned out to the the correct one. But I did think about it and didn't see any obvious approach. It's great to see the work of other minds focusing on such things! — Scott Sauyet, Jun 08 '23 at 14:41

score 0 · Answer 3 · answered Jun 05 '23 at 08:02

This is much simpler, and less robust than other suggested solutions here, but if you only want to extract the phrases in the asterisk place, this will do:

var exp = 'apple pie is * good'
var input = `apple pie is not too so good
apple pie is not so good
apple pie is too good`;

function matchAsterisk(exp, input){
    return input
        .replace(/\n/g, '')
        .split(new RegExp(exp.split('*').join('|'), 'g'))
        .filter(i => i);
}

console.log(matchAsterisk(exp, input))

score 0 · Answer 4 · answered Jun 06 '23 at 04:30

In python

import re
def find_values_between(text, start, end):
    pattern = f"{re.escape(start)}(.*?){re.escape(end)}"
    matches = re.findall(pattern, text, re.DOTALL)
    return matches

sentence = """
apple pie is amazingly good
apple pie is not good
apple pie is disgustingly good
extra line
"""

values = find_values_between(sentence, "apple pie is ", " good")
for value in values:
    print(value.strip())

Output:

amazingly
not
disgustingly