1

I have a list with around 600k strings that I've put in a static function like this:

public static HashSet<string> All()
{
    return new HashSet<string>
    {
        "entry 1", 
        "entry 2",
         ...
         "entry 600000"
    };
}

It takes very long time to start the application and it also takes very long time to use the HashSet the first time. Is there a better way to do this?

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
ekenman
  • 995
  • 1
  • 13
  • 29
  • If your issue is the initial startup time then try loading in the strings from an external source using an HTTP request after loading the application. – Ethicist Dec 16 '20 at 16:54
  • 5
    Why do you need a hashset with 600k items? How long is your file with this code?:)) – Pavel Anikhouski Dec 16 '20 at 16:55
  • I'm open for better sugestions. Would it be faster to build the HashSet from a file on disk instead? – ekenman Dec 16 '20 at 16:59
  • The list contains names that I need to remove to disidentify information in another field. – ekenman Dec 16 '20 at 17:00
  • 1
    You could serialize it to disk, perhaps with a separate program, and then deserialize that to load it. If loading it doesn't seem to work: [Serializing a HashSet](https://stackoverflow.com/q/4192905/1115360). – Andrew Morton Dec 16 '20 at 17:02
  • 5
    that function also doesn't cache that value. Make sure you aren't doing all that work multiple times – McAden Dec 16 '20 at 17:13
  • 1
    @ekenman this seems like an X-Y Problem, why do you need 600k names to disidentify imformation from another field. Can you explain why you need that or what you mean by disidentify information – johnny 5 Dec 16 '20 at 17:22
  • Does it make any difference if you start the program without the debugger attached? (Ctrl+F5) – Theodor Zoulias Dec 16 '20 at 18:00
  • Without having much detail to go on, you may be able to approximate hash-like lookup performance by replacing your HashSet with a pre-sorted string[] and a binary search. Loading the array at startup should be much faster than loading the HashSet. – glenebob Dec 16 '20 at 18:05
  • @glenebob Binary search of string array, while possibly using slightly less memory will be slower than a HashSet lookup. With background load of city data, startup performance is a non-issue. – jjxtra Dec 16 '20 at 18:25
  • 2
    A quick test shows that loading a HashSet with 600K strings actually performs pretty well - about a second on my machine. OP will need to investigate further to pinpoint what is actually causing such long startup time. – glenebob Dec 16 '20 at 19:20
  • @johnny5 What I want to do is change: "My name is Tomas" => "My name is [Masked name] – ekenman Dec 16 '20 at 20:43
  • 1
    Having that much C# code and data inline will bloat assembly and startup time. For large data sets, external file is the right choice. – jjxtra Dec 16 '20 at 21:27
  • Weird, seems like you should be constructing the message from userdata and not trying to scrub it. e.g $"My name is { this.MaskName(user.Firstname)}" that way you don't need to store 600k of names (there are ones you're probably missing anyways). If for some reason you need to scrub messages of names you should reevaluate how you're handling it – johnny 5 Dec 17 '20 at 16:01

2 Answers2

2

This is the kind of thing you can kick off as a background task when your application starts, eliminating the effect to startup delay.

Save this as your Program.cs file. It should run very quickly and have 0 impact on startup time.

You will need to create a utf-8 encoded text file, one city per line, with all your cities. In this example I assume it is in the application root directory and called data.txt but you can change this in my example code.

You can use MainNamespace.MainClass.CityExists(value) anywhere in your application to check if an entry exists.

Obviously this is crude but effective, you could refactor it into a city service class or something...

using System.Collections.Generic;
using System.Linq;

namespace MainNamespace
{
    public static class MainClass
    {
        private const string cityFilePath = "./data.txt"; // change to your correct path
        
        private static HashSet<string> cities;
        
        // use this method for city lookup checks
        public static bool CityExists(string value)
        {
            while (cities is null)
            {
                // this is unlikely to trigger, the city data will probably be loaded before your first city query, but just in case...
                System.Threading.Thread.Sleep(20);
            }
            return cities.Contains(value);
        }
        
        public static void Main()
        {
            Task.Run(async () =>
            {
                // no error handling here, you would likely want to try/catch this and do something appropriate if exception is thrown
                HashSet<string> hashSet = (await File.ReadAllLinesAsync(cityFilePath)).ToHashSet(StringComparer.OrdinalIgnoreCase);
                cities = hashSet;
            }).GetAwaiter();
            
            // go on executing the rest of your application, show main form, etc.
        }
    }
}
jjxtra
  • 20,415
  • 16
  • 100
  • 140
  • This approach should take even longer than OP's approach to load the HashSet. – glenebob Dec 16 '20 at 18:07
  • Not necessarily, the large amount of hard-coded entries in the code will increase the assembly size significantly, impacting startup time, along with making it impossible to dynamically update the list. Loading from a file can be done in a background task, removing impact to startup time. – jjxtra Dec 16 '20 at 18:11
  • @ekenman let me know if this works for you, should be 0 impact to startup time – jjxtra Dec 16 '20 at 18:23
  • 1
    @jjxtra Awesome. Loading takes less than half a second. – ekenman Dec 16 '20 at 20:38
  • Definitely proves that having such a large C# code file was impacting assembly size and startup time :) – jjxtra Dec 16 '20 at 21:26
0

I got something similar, a list of cities.

I am using an external csv file with all the values, and I read it from my program. It's way faster, in my opinion

Maxime V
  • 31
  • 8