3

I am writing an application that validates some cities. Part of the validation is checking if the city is already in a list by matching the country code and cityname (or alt cityname).

I am storing my existing cities list as:

public struct City
{
    public int id;
    public string countrycode;
    public string name;
    public string altName;
    public int timezoneId;
}

List<City> cityCache = new List<City>();

I then have a list of location strings that contain country codes and city names etc. I split this string and then check if the city already exists.

string cityString = GetCity(); //get the city string
string countryCode = GetCountry(); //get the country string
city = new City();             //create a new city object
if (!string.IsNullOrEmpty(cityString)) //don't bother checking if no city was specified
{
    //check if city exists in the list in the same country 
    city = cityCache.FirstOrDefault(x => countryCode == x.countrycode && (Like(x.name, cityString ) || Like(x.altName, cityString )));
    //if no city if found, search for a single match accross any country
    if (city.id == default(int) && cityCache.Count(x => Like(x.name, cityString ) || Like(x.altName, cityString )) == 1)
        city = cityCache.FirstOrDefault(x => Like(x.name, cityString ) || Like(x.altName, cityString ));
}

if (city.id == default(int))
{
    //city not matched
}

This is very slow for lots of records, as I am also checking other objects like airports and countries in the same way. Is there any way I can speed this up? Is there a faster collection for this kind of comparison than List<>, and is there a faster comparison function that FirsOrDefault()?

EDIT

I forgot to post my Like() function:

bool Like(string s1, string s2)
    {
        if (string.IsNullOrEmpty(s1) || string.IsNullOrEmpty(s2))
            return s1 == s2;
        if (s1.ToLower().Trim() == s2.ToLower().Trim())
            return true;

        return Regex.IsMatch(Regex.Escape(s1.ToLower().Trim()), Regex.Escape(s2.ToLower().Trim()) + ".");
    }
Random Developer
  • 175
  • 5
  • 17
  • I believe that your biggest performance issue is with the `Like` operator, which is expensive. Can't you simply use a equality comparer? – Andre Calil Jul 24 '12 at 13:06
  • can you show us how you are calling this compare method too – HELP_ME Jul 24 '12 at 13:09
  • I would recommend against doing this in memory for a couple of reasons. First because you're already seeing obvious performance issues with this mechanism, but second because you're holding a lot of information in memory strictly for the purpose of searching it. This is properly suited on the database server and the round-trip expense is very insignificant. – Mike Perrenoud Jul 24 '12 at 13:15
  • Is there a cannonical representation for a City? As M Afifi says, a hashset will be much faster. However, you need to be careful with any slight variations of names that should correspond to the same city. – Rob Jul 24 '12 at 13:18
  • @Mike - I don't see any mention of a database in the question or comments. Why do you assume there's a database attached? – JDB Jul 24 '12 at 13:40
  • I am populating the cityCache from a database table. I just assumed it was quicker to store all the data in memory than keep running queires for each validation check. Is this incorrect? – Random Developer Jul 24 '12 at 13:45
  • @RandomDeveloper You want to do this at the database level - it is suited for what you're trying to do and that's what it does best. – Mike Perrenoud Jul 24 '12 at 13:48
  • The round trip to the database and back represents a constant time. The search algorithm represents a variable time based on the number of elements. For a few elements, the added time of the round trip is going to tip the scales in favor of local search. For many items, the scales go the other way. Which is faster will depend on the number of items you actually have, the amount of time it takes to connect to the database (local web server or web service in Africa?) and the effeciency of your database. – JDB Jul 24 '12 at 17:39
  • @Cyborgx37 I think it's pretty clear by the discussions that there is a lot of data to search through and to make matters worse the Like statement needs to be used (which I understand), so the database query is going to be the most optimal pattern. Finally, the data is already pulled from a table so querying that table is pretty natural for the technology and the field can be indexed properly to provide greater performance. – Mike Perrenoud Jul 24 '12 at 22:45

2 Answers2

1

I would use a HashSet for the CityString and CountryCode. Something like

var validCountryCode = new HashSet<string>(StringComparison.OrdinalIgnoreCase);
if (validCountryCode.Contains(city.CountryCode))
{
}

etc...

Personally I would do all the validation in the constructor to ensure only valid City objects exist.

Other things to watch out for performance

  1. Use HashSet if you're looking it up in a valid list.
  2. Use IEqualityComparer where appropriate, reuse the object to avoid the construction/GC costs.
  3. Use a Dictionary for anything you need to lookup (e.g. timeZoneId)

Edit 1

You're cityCache could be something like,

var cityCache = new Dictionary<string, Dictionary<string, int>>();
var countryCode = "";
var cityCode = "";
var id = x;

public static IsCityValid(City c)
{
     return
         cityCache.ContainsKey(c.CountryCode) &&
         cityCache[c.CountryCode].ContainsKey(c.CityCode) &&
         cityCache[c.CountryCode][c.CityCode] == c.Id;  
}

Edit 2

Didn't think I have to explain this, but based on the comments, maybe.

FirstOrDefault() is an O(n) operation. Essentially everytime you are trying to find a find something in a list, you can either be lucky and it is the first in the list, or unlucky and it is the last, average of list.Count / 2. A dictionary on the other hand will be an O(1) lookup. Using the IEqualtiyComparer it will generate a HashCode() and lookup what bucket it sits in. If there are loads of collisions only then will it use the Equals to find what you're after in the list of things in the same bucket. Even with a poor quality HashCode() (short of returning the same HashCode always) because Dictionary / HashSet use prime number buckets you will split your list up reducing the number of Equalities you need to complete.

So a list of 10 objects means you're on average running LIKE 5 times. A Dictionary of the same 10 objects as below (depending on the quality of the HashCode), could be as little as one HashCode() call followed by one Equals() call.

M Afifi
  • 4,645
  • 2
  • 28
  • 48
  • Note that the code uses the Like operator - this code won't work unless that changes. – Mike Perrenoud Jul 24 '12 at 13:25
  • Easily fixed by having the correct IEqualityComparer passed into the Dictionary. ContainsKey will then match what you're after if implemented correctly. – M Afifi Jul 24 '12 at 13:27
  • And so if you re-implement the Like operation there you're not fixing the performance issue - yeah? This is obviously a very large data set and so doing a Like requires a scan where as there are other options if it's a direct equal - like a binary search and more. If the Like is required - let the database engine do it because it's suited for it. – Mike Perrenoud Jul 24 '12 at 13:32
  • Err... no you're wrong. He's paying the penalty in many ways, worth adding as an edit. – M Afifi Jul 24 '12 at 13:35
  • I decided to use your approach, and use dictionaries. I settled with using Dictionary> so that I can easily search for exact matches on the outer string key (country code) and the inner string key (cityname). If those checks fail I can fall back to my linq search but this can be optimised by only searching the dictionary of the correct country. I decided not to use database queries because as I progress through the checks, I add cities to the cache if they are new. I then add them to the table with a transaction query which I need to be able to rollback. – Random Developer Jul 25 '12 at 08:38
  • @RandomDeveloper just make sure you use the correct IEqualityComparer for strings. For e.g. if you care about culture of case sensitivity or not. – M Afifi Jul 25 '12 at 08:52
0

This sounds like a good candidate for a binary tree.

For binary tree implementations in .NET, see: Objects that represent trees

EDIT:
If you want to search a collection quickly, and that collection is particularly large, then your best option is to sort it and implement a search algorithm based on that sorting.

Binary trees are a good option when you want to search quickly and insert items relatively infrequently. To keep your searches quick, though, you'll need to use a balancing binary tree.

For this to work properly, though, you'll also need a standard key to use for your cities. A numeric key would be best, but strings can work fine too. If you concatenated your city with other information (such as the state and country) you will get a nice unique key. You could also change the case to all upper- or lower-case to get a case-insensitive key.

If you don't have a key, then you can't sort your data. If you can't sort your data, then there's not going to many "quick" options.

EDIT 2:
I notice that your Like function edits your strings a lot. Editing a string is an extremely expensive operation. You would be much better off performing the ToLower() and Trim() functions once, preferably when you are first loading your data. This will probably speed up your function considerably.

Community
  • 1
  • 1
JDB
  • 25,172
  • 5
  • 72
  • 123
  • Why would you use a binary tree to search what's currently an unordered list? – Dan Puzey Jul 24 '12 at 13:40
  • @DanPuzey - The poster says he is building his own list. If he uses a Binary Tree, it would not be unordered. (Or, the user could easily build a binary tree from the unordered list.) – JDB Jul 24 '12 at 13:41
  • @DanPuzey - Would also like to add that you CAN'T "search" with a binary tree. Your question doesn't make sense. (You CAN search a binary tree, but that would mean you'd need to build the tree first.) – JDB Jul 24 '12 at 14:09
  • That is to say, the binary tree is not itself a search algorithm, but a way of organizing the data so that it can be searched. – JDB Jul 24 '12 at 17:34
  • Yes, I know full well what a binary tree is. But if you have a load of values and you want to know whether a certain "test" value is among them (as appears to be the case in the question), why not use a pre-existing .NET data structure such as a [`HashSet`](http://msdn.microsoft.com/en-us/library/bb359438.aspx), which is comparable in performance (when finding a member) and doesn't require any custom implementation? – Dan Puzey Jul 24 '12 at 19:46
  • I'm not sure what implementation HashSet is using under the covers. Probably a binary tree or some such structure. I doubt it's using a Hash - the memory requirement would be too large for a long string. In any case, I'm not advocating a custom implementation - the link above provides complete, fully-functional binary tree implementations. Quick, easy, done... and fast. Still a legit answer - disagree with the downvote. – JDB Jul 24 '12 at 22:04