Search List of Strings for Matched Words in Sequence

Question

I have a List of strings from an external source that always change.

I want to search through each string, find the matched words in sequence between all strings.

Then remove those group of words from each string, leaving only the Title of the book.

Examples

The book named The Lord of the Rings is a classic.
The book named War and Peace is a classic.
The book named The Three Musketeers is a classic.

The book named will be removed.
is a classic. will be removed.
The book named The sequence is not removed, because War and Peace does not start with The.

The sequences must occur between all strings in order to be removed.

The Lord of the Rings
War and Peace
The Three Musketeers

This is an example list. I want to use this on strings other than book titles.

Such as:

I went to The Home Depot.
I went to Walgreens.
I went to Best Buy.

I went to is removed.

The basketball team Los Angeles Lakers are my favorite.
The basketball team New York Knicks are my favorite.
The basketball team Chicago Bulls are my favorite.

The basketball team is removed.
are my favorite. is removed.

Solution

My idea is to search the string from the beginning, group the matched words until it reaches a word that does not match, to find the prefix.

Then do the same starting from the end of the string backwards, to find the suffix.

And it will reveal the Title in the middle.

But I don't know how to go about doing it.

C#

List<string> sentences = new List<string>() 
{ 
    "The book named The Lord of the Rings is a classic.",
    "The book named War and Peace is a classic.",
    "The book named The Three Musketeers is a classic.",
};

List<string> titles = new List<string>() 


for (int i = 0; i < sentences.Count; i++)
{
    // Add Titles to their own List
    //
    titles.Add(FindTitle(sentence[i]));
}


String FindTitle(string sentence) 
{
    string title = string.Empty;

    // compare all strings in List
    // group common word sequences prefix (The book named)
    // group common word sequences suffix (is a classic.)
    // remove those word sequences from each string in List

    return title;
}

this will help you https://stackoverflow.com/questions/18715688/find-common-substring-between-two-strings — Rudresha Parameshappa, Apr 10 '18 at 23:51
get the strings first and then trim them from start and end. — Rudresha Parameshappa, Apr 10 '18 at 23:51
@CamiloTerevinto There have been multiple attempts over several days and now I'm trying a different solution using grouped words instead of individual characters. https://stackoverflow.com/a/49714706/6806643 — Matt McManis, Apr 10 '18 at 23:55
Build n-grams. Where if n is 3, you are finding for any chain of 1, 2, or 3 words, all occurrences of the next word which followed them. — James, Apr 10 '18 at 23:56
So... you posted a duplicate? You should post a bounty on that question, not ask a new one — Camilo Terevinto, Apr 10 '18 at 23:56
@CamiloTerevinto It's not a duplicate, I'm searching for Matched Words in Sequence. Not Title Extracting any possible way. — Matt McManis, Apr 10 '18 at 23:57
@MattMcManis - So your Input will always have a similar string sets or can be mixed? I mean All "The book named xxxxxxx" strings will come as a bunch.. then you want to use same function for other "I went to xxxx" strings etc? Or the Input will a large set of mixed strings... — Prateek Shrivastava, Apr 11 '18 at 02:47
@PrateekShrivastava Input will always have similar string sets, not mixed. I want to use the same function on different sets of strings. — Matt McManis, Apr 11 '18 at 02:54
@MattMcManis -- That solves a bit of complexity - hang on - I may have a solution for you. Writing & testing it out. — Prateek Shrivastava, Apr 11 '18 at 02:58

Prateek Shrivastava · Accepted Answer · 2018-04-11T05:30:32.257

Here's My Approach. I took the performance route - can still be optimized I guess.

Edited: Used regex.Escape to help solve the Special Characters situation.

Used Stopwatch to time My v/s Rufus L's solution.

Using - Rufus L's Test Sentence Input:

private static List<List<string>> GetTestSentences()
{
    return new List<List<string>>
    {
        new List<string>()
        {
            "The book named The Lord of the Rings is a classic.",
            "The book named War and Peace is a classic.",
            "The book named The Three Musketeers is a classic.",
        },
        new List<string>
        {
            "I went to The Home Depot.",
            "I went to Walgreens.",
            "I went to Best Buy."
        },
        new List<string>
        {
            "The basketball team Los Angeles Lakers are my favorite.",
            "The basketball team New York Knicks are my favorite.",
            "The basketball team Chicago Bulls are my favorite."
        },
        new List<string>()
        {
            "The book named Lord of the Flies is a classic (500 This is a test)",
            "The book named Wuthering Heights is a classic (500 This is a test)",
            "The book named Great Expectations is a classic (500 This is a test)",
            "The book named The Lord of the Rings is a classic (500 This is a test)",
            "The book named War and Peace is a classic (500 This is a test)"
        }
    };
}

From Main Method Do:

foreach (var sentenceList in GetTestSentences())
{
    var prefix = FindMatchingPattern(sentenceList[0], sentenceList[1], true);
    var suffix = FindMatchingPattern(sentenceList[0], sentenceList[1], false);

    if (prefix.Length > 0)
        prefix = Regex.Escape(prefix);
    if (suffix.Length > 0)
        suffix = Regex.Escape(suffix);

    foreach (var item in sentenceList)
    {
        var result = Regex.Replace(item, prefix, string.Empty);
        result = Regex.Replace(result, suffix, string.Empty);
        Console.WriteLine($"{item} --> {result}");
    }
    Console.WriteLine(new string('-', Console.WindowWidth));
}

And Here's the magic method:

private static string FindMatchingPattern(string sample1, string sample2, bool forwardDirection)
{
    string shorter = string.Empty;
    string longer = string.Empty;

    if (sample1.Length <= sample2.Length)
    {
        shorter = sample1;
        longer = sample2;
    }
    else
    {
        shorter = sample2;
        longer = sample1;
    }

    StringBuilder matchingPattern = new StringBuilder();
    StringBuilder wordHolder = new StringBuilder();

    if (forwardDirection)
    {
        for (int idx = 0; idx < shorter.Length; idx++)
        {
            if (shorter[idx] == longer[idx])
                if (shorter[idx] == ' ')
                {
                    matchingPattern.Append(wordHolder + " ");
                    wordHolder.Clear();
                }
                else
                    wordHolder.Append(shorter[idx]);
            else
                break;
        }
    }
    else
    {
        while (true)
        {
            if (shorter.Length > 0 && shorter[shorter.Length - 1] == longer[longer.Length - 1])
            {
                if (shorter[shorter.Length - 1] == ' ')
                {
                    matchingPattern.Insert(0, " " + wordHolder);
                    wordHolder.Clear();
                }
                else
                    wordHolder.Insert(0, shorter[shorter.Length - 1]);

                shorter = shorter.Remove(shorter.Length - 1, 1);
                longer = longer.Remove(longer.Length - 1, 1);
            }
            else
            {
                break;
            }
        }
    }

    return matchingPattern.ToString();
}

So the difference with this is it's much faster? Were there other improvements Rufus's was missing? — Matt McManis, Apr 11 '18 at 04:29
While testing I found something. `Test 01` http://rextester.com/PSU72603. `Test 02` http://rextester.com/PQNF4092 If there are extra number/words in parentheses it will not exclude them. However Rufus's works correctly. — Matt McManis, Apr 11 '18 at 04:39
@MattMcManis -- Thats coz of the Regex characters (as I mentioned) -- Back from lunch now-- here's the update. — Prateek Shrivastava, Apr 11 '18 at 05:11

Rufus L · Answer 2 · 2018-04-12T17:45:12.043

Update I modified the sample data to include different types of tests, and modified the RemoveCommonPrefixAndSuffix to handle these new tests.

I found that just comparing the first two strings for a common prefix and suffix can be a mistake if the first two books (or whatever the topic is) begin and/or end with the same words.

For example:

new List<string>()
{
    "The book named Lord of the Rings 2 is a classic.",
    "The book named Lord of the Flies 2 is a classic.",
    "The book named This is pretty is a classic.",                
    "The book named War and Peace is a classic.",
    "The book named The Three Musketeers is a classic.",                
},

Here, if we only compare the first two sentences, we determine that the common prefix is "The book named Lord of the", which is incorrect. We also determine that the common suffix is "2 is a classic.", which is also incorrect.

Here's a solution that addresses this by ensuring that all the sentences have the same prefix and suffix:

public static List<string> RemoveCommonPrefixAndSuffix(List<string> sentences,
    int minSeqenceLength = 2)
{
    if (sentences == null) return null;

    if (sentences.Count < 2 ||
        sentences.Any(s => s.Count(c => c == ' ') < minSeqenceLength - 1))
    {
        return sentences.ToList();
    }

    if (sentences.All(s => s == sentences[0]))
    {
        return sentences.Select(s => string.Empty).ToList();
    }

    var sentenceWords = sentences.Select(s => s.Split()).ToList();
    var firstSentence = sentenceWords[0];
    var length = sentenceWords.Min(s => s.Length);
    var commonPrefix = new StringBuilder();
    var commonSuffix = new StringBuilder();
    var prefixDone = false;
    var suffixDone = false;

    for (var i = 0; i < length && !(prefixDone && suffixDone); i++)
    {
        if (!prefixDone && sentenceWords.All(s => s[i] == firstSentence[i]))
        {
            commonPrefix.Append(firstSentence[i] + " ");
        }
        else
        {
            prefixDone = true;
        }

        if (!suffixDone && sentenceWords.All(s =>
            s[s.Length - i - 1] == firstSentence[firstSentence.Length - i - 1]))
        {
            commonSuffix.Insert(0, firstSentence[firstSentence.Length - i - 1] + " ");
        }
        else
        {
            suffixDone = true;
        }
    }

    var prefix = commonPrefix.ToString().Count(c => c == ' ') >= minSeqenceLength - 1
        ? commonPrefix.ToString()
        : string.Empty;

    var suffix = commonSuffix.ToString().Count(c => c == ' ') >= minSeqenceLength - 1
        ? commonSuffix.ToString()
        : string.Empty;

    var commonLength = prefix.Length + suffix.Length;

    return sentences
        .Select(s => s.Length > commonLength
            ? s.Substring(prefix.Length, s.Length - prefix.Length - suffix.Length)
            : string.Empty)
        .ToList();
}

Here's the method to get the test data:

private static List<List<string>> GetTestSentences()
{
    return new List<List<string>>
    {
        // Prefix-only test
        new List<string>
        {
            "I went to The Home Depot",
            "I went to Walgreens",
            "I went to Best Buy",
        },
        // Suffix-only test
        new List<string>
        {
            "Game of Thrones is a good TV series",
            "Breaking Bad is a good TV series",
            "The Office is a good TV series",
        },
        // Prefix / Suffix test
        new List<string>
        {
            "The basketball team Los Angeles Lakers are my favorite",
            "The basketball team New York Knicks are my favorite",
            "The basketball team Chicago Bulls are my favorite",
        },
        // No prefix or suffix - all sentences are different
        new List<string>
        {
            "I went to The Home Depot",
            "Game of Thrones is a good TV series",
            "The basketball team Los Angeles Lakers are my favorite",
        },
        // All sentences are the same - no "topic" between prefix and suffix
        new List<string>()
        {
            "These sentences are all the same",
            "These sentences are all the same",
            "These sentences are all the same",
        },
        // Some sentences have no content between prefix and suffix
        new List<string>()
        {
            "This sentence has no topic",
            "This sentence [topic here] has no topic",
            "This sentence has no topic",
            "This sentence [another one] has no topic",
        },
        // First two topics have common beginnings
        new List<string>()
        {
            "The book named Lord of the Rings is a classic",
            "The book named Lord of the Flies is a classic",
            "The book named This is pretty is a classic",
            "The book named War and Peace is a classic",
            "The book named The Three Musketeers is a classic",
        },
        // The first two topics have a common ending
        new List<string>
        {
            "The movie named Matrix 2 is very good",
            "The movie named Avatar 2 is very good",
            "The movie named The Sound of Music is very good",
            "The movie named Terminator 2 is very good",
        }
    };
}

Below is the example usage and output. I also included the results from the selected answer, along with some perf benchmarks for speed comparison:

private static void Main()
{
    var sentenceLists = GetTestSentences();
    var padLength = sentenceLists.Max(t => t.Max(s => s.Length)) + 2;
    Console.WriteLine("\nComparison Results\n------------------\n");

    // Rufus' solution
    var sw = Stopwatch.StartNew();
    foreach (var sentenceList in sentenceLists)
    {
        var trimmedSentences = RemoveCommonPrefixAndSuffix(sentenceList);

        for (var j = 0; j < trimmedSentences.Count; j++)
        {
            Console.WriteLine("{0} {1}", sentenceList[j].PadRight(padLength, '.'),
                trimmedSentences[j]);
        }

        Console.WriteLine();
    }
    sw.Stop();

    Console.WriteLine($"Rufus' solution took {sw.ElapsedMilliseconds} ms\n");
    Console.WriteLine(new string('-', Console.WindowWidth));

    // Prateek's solution
    sw.Restart();
    foreach (var sentenceList in sentenceLists)
    {
        var prefix = FindMatchingPattern(sentenceList[0], sentenceList[1], true);
        var suffix = FindMatchingPattern(sentenceList[0], sentenceList[1], false);

        if (prefix.Length > 0) prefix = Regex.Escape(prefix);
        if (suffix.Length > 0) suffix = Regex.Escape(suffix);

        foreach (var item in sentenceList)
        {
            var result = Regex.Replace(item, prefix, string.Empty);
            result = Regex.Replace(result, suffix, string.Empty);
            Console.WriteLine($"{item.PadRight(padLength, '.')} {result}");
        }

        Console.WriteLine();
    }
    sw.Stop();

    Console.WriteLine($"Prateek's solution took {sw.ElapsedMilliseconds} ms\n");
    Console.WriteLine(new string('-', Console.WindowWidth));

    GetKeyFromUser("\nDone!! Press any key to exit...");
}

Output

I am trying to get it working with this layout in Rexter, can you take a look. http://rextester.com/PFZBZG28715 — Matt McManis, Apr 12 '18 at 00:14
Your sentence list is a different format. It was a `List>` in my answer above, and you've changed it to a `List`. So you'll have to either modify the code that uses it, or just include the method that returns the `List>` result. — Rufus L, Apr 12 '18 at 00:20
I added the method to the code above if you want to use it, and I've updated the code for you here: [Title Extractor 04 Test](http://rextester.com/edit/PFZBZG28715) (I also had to add `using System.Text;` to use the `StringBuilder` class). — Rufus L, Apr 12 '18 at 00:31
I see, I can just load an existing List into the `GetTestSentences()`. This is how my project will have to use it. http://rextester.com/TOHRR68098 — Matt McManis, Apr 12 '18 at 01:15
One more thing, is there something you can add to make it return `string.Empty` if all sentences match and it can't extract a Title? I filter the sentences sometimes and ran into this problem. http://rextester.com/ISV17067 — Matt McManis, Apr 12 '18 at 03:02
I've updated the method and the test data to include different types of tests. See the comments above each list of sentences for the testcase they represent. I also included a new image of the output. Both this answer and the other one handle the "normal" cases fine (prefix only, suffix only, prefix and suffix, no prefix or suffix). But my update now handles the new tests as well. — Rufus L, Apr 12 '18 at 17:42
Thanks, I am using both your solution and Prateek's in my project, each for different purposes. — Matt McManis, Apr 13 '18 at 00:33
I added `if (sentences.All(x => x == sentences.First())) return new List();` to prevent the prefix being returned as the Title if all sentences are the same and one word. Before http://rextester.com/GVK38376 After http://rextester.com/VQFAYC20518 This happened when I applied other regex filters. — Matt McManis, Apr 13 '18 at 04:22
Is that a question? The code is behaving as it's written. Read it and try to understand what each of the preconditions do, and modify them accordingly. Feel free to ask if you don't understand one of them. There is still one of the original conditions that checks to see if the sentences contain `minSeqenceLength` words. Maybe you want to remove that one? — Rufus L, Apr 13 '18 at 15:08

Search List of Strings for Matched Words in Sequence

Examples

Solution

2 Answers2