0

I have a curious conundrum, I have a list of data, (for this case) it may look something like:

  • Company XXXXXXX
  • YYYYYY Incorporated
  • Comp ZZ Inc.
  • Com AAA BB
  • StackOverflow
  • Stack Overflow
  • (etc...)

I would like to incorporate a minimized version onto the end of a series of Worksheets in Excel, say something like this(Remember the max size of a tab is 31 characters):

  • Planning Setup Sheet XXXXXXX
  • Planning Setup Sheet YYYYYY
  • Planning Setup Sheet ZZ
  • Planning Setup Sheet AAA BB
  • Planning Setup Sheet StackOver
  • Planning Setup Sheet Stack Ove

Now this is easy to do by hand, but very difficult to accomplish in an automated fashion, how does one recognize what is important in a string vs what can be removed? Clearly various characters and groupings of characters have varying levels of significance, how can one define that distribution and specific a cutoff?

This almost strikes me as something that would be a good candidate for a NLP type algorithm, or maybe generate a very large list and train a neural network to minimize strings, but I'd rather not overkill it, I would hope there'd be an easier way to go about this...

I feel as thought this is a classic problem, but I don't see any mention to it while googling around, typical solutions to similar problem usually revolve around just generating some random string. Maybe I'm searching with the wrong terms...

Does anyone have a easy way to minimize a string without losing is meaningfulness?

abatishchev
  • 98,240
  • 88
  • 296
  • 433
David Rogers
  • 2,601
  • 4
  • 39
  • 84
  • Unless there's a discernible pattern, you're kind of sunk. Can you modify your list of data to conform to some pattern that you can use for restructuring? – itsme86 May 08 '18 at 21:20
  • @itsme86 I would love to but unfortunately no, the data comes from a subsystem I cannot control. Typically the data is already short, so this is not a problem with all names, but it does occur with a few. – David Rogers May 08 '18 at 21:22
  • Are all of the worksheets going to start with `Planning Setup Sheet`? Or will the workbook contain other sheets? – Joseph May 08 '18 at 21:38
  • Most certainly it will have others, and `Planning Setup Sheet` is just an example of a prefix, I might adjust that to be shorter depending on the solution, but no matter what I'd prefix it with I'm still limited to a fairly small character space. I just use that in my example because it easily allows me to communicates the issue in an accurate manner. – David Rogers May 08 '18 at 21:58

2 Answers2

1

One idea would be to have a set of common words that you can rule out, then go through your data, strip out the common words, add your prefix and cut off the rest.

Here's a sample:

using System;
using System.Collections.Generic;
using System.Linq;

class Program
{
    static void Main(string[] args)
    {
        var data = new List<string>
        {
            "Company XXXXXXX",
            "YYYYYY Incorporated",
            "Comp ZZ Inc.",
            "Com AAA BB",
            "StackOverflow",
            "Stack Overflow",
        };

        foreach (var worksheetName in GetWorksheetNames(data))
        {
            Console.WriteLine(worksheetName);
        }
    }

    private static IEnumerable<string> GetWorksheetNames(IEnumerable<string> data)
    {
        const string worksheetPrefix = "Planning Setup Sheet";
        const int maxWorksheetLength = 31;

        var commonWords = new List<string>
        {
            "Com",
            "Comp",
            "Company",
            "Inc",
            "Inc.",
            "Incorporated",
        };

        foreach (var item in data)
        {
            var acceptedWords = new List<string>
            {
                worksheetPrefix
            };

            var words = item.Split(' ');

            acceptedWords.AddRange(words.Where(word => !commonWords.Contains(word, StringComparer.OrdinalIgnoreCase)));

            var initialWorksheetName = string.Join(" ", acceptedWords);
            var finalWorksheetName = initialWorksheetName.Substring(0, Math.Min(initialWorksheetName.Length, maxWorksheetLength));

            yield return finalWorksheetName;
        }
    }
}

I'm sure you can squeeze out some performance with some tweaks here and there.

Also, another solution would be to just use whatever you have up to 31 characters, cut off the rest, and then when you create the workbook have a "table of contents" worksheet with the full name of the actual worksheet and have it hyperlink to the sheet. That might help with the end user experience.

Joseph
  • 5,070
  • 1
  • 25
  • 26
  • :) The hard part comes with guessing what the common words are, I've tried that in the past and have been wrong, the table of contents idea I'm already doing, though I see that more as an ancillary way to address the issue. Even when you cut off the common words, there are many scenario's where the string is too long and needs to be cut down further, the problem is when you cut into that part too, where does cutting lose the meaningfulness, what if cutting creates duplicates? I see it as an easy problem to communicate, but very tricky to solve for most common cases... – David Rogers May 09 '18 at 16:24
  • @DavidRogers as I was writing the answer I had a feeling that you might have given it a shot already, but I figured I'd write it down anyway for others to benefit as well. I like your idea about using some kind of deep learning, but I imagine you would need a lot of help with the input data for the AI. Still, it could be a fun project to try :) – Joseph May 09 '18 at 17:09
0

I have come up with several ways to do this, but I'm not satisfied with any of them... Here would be one example of a way I could go about doing this:

public List<Tuple<string, string>> GetUniqueNameOfCharacterXFromList(List<string> input, int maxStringSize = 8) {
    var output = input.Select(a => new Tuple<string, string>(a, a)).ToList();

    while (output.Select(a => a.Item2.ToCharArray().Length).Max() > maxStringSize) {
        var letter = output
            .Where(a => a.Item2.ToCharArray().Length > maxStringSize)
            .SelectMany(a => a.Item2.ToCharArray().Distinct())
            .GroupBy(a => a)
            .Select(a => new { letter = a.Key, count = a.Count() })
            .OrderByDescending(a => a.letter == ' ')
            .ThenByDescending(a => a.count)
            .First().letter.ToString();

        output = output.Select(a => new Tuple<string, string>(a.Item1, (a.Item2.ToCharArray().Length > maxStringSize ? a.Item2.Replace(letter, "") : a.Item2))).ToList();
    }

    return output;
}
David Rogers
  • 2,601
  • 4
  • 39
  • 84