-2

I have a string for example

string text = "xfoofoobarbar fooxxfoo barxxxfoo";

This string contains 5x foo which is the longest, most appearing repeated sub string with at least 2 characters within this string so it's my desired result.

  • bar appears only 3x so it's not the mostly appearing sub string
  • oo is also 5x within the string but foo is longer - so foo is to prefer
  • XababaY would result into ab which exists 2x (no overlapping, 2x ba is ignored because ab comes first)
  • XaaaaaaaY would result into aa because aa appears 3 times and it has the most repetion.

I would love to show some approaches what I've tried so far but I have honestly no idea where to start. Linq? RegEx?

A hint/approach into the right direction would help me too.

Toshi
  • 2,532
  • 4
  • 17
  • 45
  • Take a look on [this answer](http://stackoverflow.com/a/14670691/2846483). – dymanoid Nov 11 '16 at 12:33
  • This seems much to human to do simply, you are picking and choosing what you get at this point based your preference. If I were a computer and I saw `XababaY` I would return a, because there's three of them, why would I suddenly decide `ab` is better? – Alfie Goodacre Nov 11 '16 at 12:40
  • @AlfieGoodacre I updated my question - at least 2 characters – Toshi Nov 11 '16 at 12:42
  • The idea to start is *suffix algorithms*: *suffix array* (easier to implment), *suffix tree* (a bit faster - `O(N)` v. `O(N*log(N))`) – Dmitry Bychenko Nov 11 '16 at 12:47
  • https://en.wikipedia.org/wiki/Suffix_tree and https://en.wikipedia.org/wiki/Suffix_array – Dmitry Bychenko Nov 11 '16 at 12:54
  • Ive got an implementation of this, but `xfoo` wins, its repeated twice and is longer than `foo`?!? You need to decide if you're ordering by "longest length" or "most occurrences". You cant pick and choose. – Jamiec Nov 11 '16 at 13:07

1 Answers1

2

I would say the first place to start here is to generate a list of all the possible substrings from the input of length 2 to the length of the input:

string text = "xfoofoobarbar fooxxfoo barxxxfoo";
var allSubstrings = Enumerable.Range(2,text.Length)
            .ToDictionary(k => k,v => FindSubStrings(text,v));

...
IEnumerable<string> FindSubStrings(string input, int length)
{
    for(var i=0;i<input.Length-length;i++)
    {
        yield return input.Substring(i,length);
    }
}

Live example: http://rextester.com/ZUR68480

From there it should be as simple as grouping by the substring to get a count, and ordering the result appropriately. But your requirements seem to pick and choose between "longest length" and "most occurrences", you cant have both!

Here is my full implementation, which I should point out chooses xfoo as the winner at present.

public static void Main(string[] args)
{
    string text = "xfoofoobarbar fooxxfoo barxxxfoo";
    var allSubstrings = Enumerable.Range(2,text.Length-2)
        .Select(x => {
                var longestSub = FindSubStrings(text,x).GroupBy(y => y).OrderByDescending(y => y.Count()).FirstOrDefault();
                return new Substrings {
                    Length = x,
                    Count = longestSub.Count(),
                    Value = longestSub.Key
                };
        });

    foreach(var item in allSubstrings)
    {
        Console.WriteLine(item.Length + ":" + item.Count + ":" + item.Value);
    }

    var best = allSubstrings.Where(x => x.Count>1).OrderByDescending(x => x.Length).ThenByDescending(x => x.Count).First();

    Console.WriteLine("Longest, most frequest substring is " + best.Value);
}

public class Substrings
{
    public int Length{get;set;}
    public int Count{get;set;}
    public string Value{get;set;}
}

private static IEnumerable<string> FindSubStrings(string input, int length)
{
    for(var i=0;i<input.Length-length;i++)
    {
        yield return input.Substring(i,length);
    }
}

Live example: http://rextester.com/RJNP55827

Jamiec
  • 133,658
  • 13
  • 134
  • 193