3

I am trying to build an efficient algorithm that can process thousands of rows of data that contains zip codes of customers. I would then want to cross check those zip codes against a grouping of around 1000 zip codes, but I have about 100 columns of 1000 zip codes. A lot of these zip codes are consecutive numbers, but there is also a lot of random zip codes thrown in there. So what I would like to do is group consecutive zip codes together that I can then just check to see if the zip code falls within that range instead of checking it against every single zip code.

Sample data -

90001
90002
90003
90004
90005
90006
90007
90008
90009
90010
90012
90022
90031
90032
90033
90034
90041

This should be grouped as follows:

{ 90001-90010, 90012, 90022, 90031-90034, 90041 }

Here's my idea for the algorithm:

public struct gRange {
   public int start, end;

   public gRange(int a, int b) {
      start = a;
      if(b != null) end = b;
      else end = a;
   }
}

function groupZips(string[] zips){
    List<gRange> zipList = new List<gRange>();
    int currZip, prevZip, startRange, endRange;
    startRange = 0;

    bool inRange = false;

    for(int i = 1; i < zips.length; i++) {
        currZip = Convert.ToInt32(zips[i]);
        prevZip = Convert.ToInt32(zips[i-1]);

        if(currZip - prevZip == 1 && inRange == false) {
            inRange = true;
            startRange = prevZip;
            continue;
        }
        else if(currZip - prevZip == 1 && inRange == true) continue;
        else if(currZip - prevZip != 1 && inRange == true) {
            inRange = false;
            endRange = prevZip;
            zipList.add(new gRange(startRange, endRange));
            continue;
        }
        else if(currZip - prevZip != 1 && inRange == false) {
            zipList.add(new gRange(prevZip, prevZip));
        }
        //not sure how to handle the last case when i == zips.length-1
    }
}

So as of now, I am unsure of how to handle the last case, but looking at this algorithm, it doesn't strike me as efficient. Is there a better/easier way to be sorting a group of numbers like this?

Adjit
  • 10,134
  • 12
  • 53
  • 98
  • I would just put a check after the loop to handle closing off the last range. – juharr Feb 01 '16 at 16:31
  • 1
    I'd suggest, rather than creating a list and fiddling with ranges, you create a HashSet with your zip codes and simply use the Contains method to check whether a given zip exists in the set. HashSet.Contains is very efficient. – sschimmel Feb 01 '16 at 16:46
  • 1
    Are these zip codes always in the increasing order? If yes, the solution is fairly easy - you just walk them and figure out the ranges. If not, use hashtables or binary arrays. – Огњен Шобајић Feb 01 '16 at 19:03
  • As @ОгњенШобајић mentioned, if the codes are already sorted, then the solution is `O(n)`. – vgru Feb 01 '16 at 20:00

3 Answers3

2

Here is a O(n) solution even if your zip codes are not guaranteed to be in order.

If you need the output groupings to be sorted, you can't do any better than O(n*log(n)) because somewhere you'll have to sort something, but if grouping the zip codes is your only concern and sorting the groups isn't required then I'd use an algorithm like this. It makes good use of a HashSet, a Dictionary, and a DoublyLinkedList. To my knowledge this algorithm is O(n), because I believe that a HashSet.Add() and HashSet.Contains() are performed in constant time.

Here is a working dotnetfiddle

// I'm assuming zipcodes are ints... convert if desired
// jumbled up your sample data to show that the code would still work
var zipcodes = new List<int>
{
    90012,
    90033,
    90009,
    90001,
    90005,
    90004,
    90041,
    90008,
    90007,
    90031,
    90010,
    90002,
    90003,
    90034,
    90032,
    90006,
    90022,
};

// facilitate constant-time lookups of whether zipcodes are in your set
var zipHashSet = new HashSet<int>();

// lookup zipcode -> linked list node to remove item in constant time from the linked list
var nodeDictionary = new Dictionary<int, DoublyLinkedListNode<int>>();

// linked list for iterating and grouping your zip codes in linear time
var zipLinkedList = new DoublyLinkedList<int>();

// initialize our datastructures from the initial list
foreach (int zipcode in zipcodes)
{
    zipLinkedList.Add(zipcode);
    zipHashSet.Add(zipcode);
    nodeDictionary[zipcode] = zipLinkedList.Last;
}

// object to store the groupings (ex: "90001-90010", "90022")
var groupings = new HashSet<string>();

// iterate through the linked list, but skip nodes if we group it with a zip code
// that we found on a previous iteration of the loop
var node = zipLinkedList.First;
while (node != null)
{
    var bottomZipCode = node.Element;
    var topZipCode = bottomZipCode;

    // find the lowest zip code in this group
    while (zipHashSet.Contains(bottomZipCode - 1))
    {
        var nodeToDel = nodeDictionary[bottomZipCode - 1];

        // delete node from linked list so we don't observe any node more than once
        if (nodeToDel.Previous != null)
        {
            nodeToDel.Previous.Next = nodeToDel.Next;
        }
        if (nodeToDel.Next != null)
        {
            nodeToDel.Next.Previous = nodeToDel.Previous;
        }
        // see if previous zip code is in our group, too
        bottomZipCode--;
    }
    // get string version zip code bottom of the range
    var bottom = bottomZipCode.ToString();

    // find the highest zip code in this group
    while (zipHashSet.Contains(topZipCode + 1))
    {
        var nodeToDel = nodeDictionary[topZipCode + 1];

        // delete node from linked list so we don't observe any node more than once
        if (nodeToDel.Previous != null)
        {
            nodeToDel.Previous.Next = nodeToDel.Next;
        }
        if (nodeToDel.Next != null)
        {
            nodeToDel.Next.Previous = nodeToDel.Previous;
        }

        // see if next zip code is in our group, too
        topZipCode++;
    }

    // get string version zip code top of the range
    var top = topZipCode.ToString();

    // add grouping in correct format
    if (top == bottom)
    {
        groupings.Add(bottom);
    }
    else
    {
        groupings.Add(bottom + "-" + top);
    }

    // onward!
    node = node.Next;
}


// print results
foreach (var grouping in groupings)
{
    Console.WriteLine(grouping);
}

** a small refactoring of the common linked list node deletion logic is in order

If Sorting is Required

A O(n*log(n)) algorithm is much simpler, because once you sort your input list the groups can be formed in one iteration of the list with no additional data structures.

Frank Bryce
  • 8,076
  • 4
  • 38
  • 56
1

I believe you are overthinking this one. Just using Linq against an IEnumerable can search 80,000+ records in less than 1/10 of a second.

I used the free CSV zip code list from here: http://federalgovernmentzipcodes.us/free-zipcode-database.csv

using System;
using System.IO;
using System.Collections.Generic;
using System.Data;
using System.Data.OleDb;
using System.Linq;
using System.Text;

namespace ZipCodeSearchTest
{
    struct zipCodeEntry
    {
        public string ZipCode { get; set; }
        public string City { get; set; }
    }
    class Program
    {
        static void Main(string[] args)
        {
            List<zipCodeEntry> zipCodes = new List<zipCodeEntry>();

            string dataFileName = "free-zipcode-database.csv";
            using (FileStream fs = new FileStream(dataFileName, FileMode.Open, FileAccess.Read))
            using (StreamReader sr = new StreamReader(fs))
                while (!sr.EndOfStream)
                {
                    string line = sr.ReadLine();
                    string[] lineVals = line.Split(',');
                    zipCodes.Add(new zipCodeEntry { ZipCode = lineVals[1].Trim(' ', '\"'), City = lineVals[3].Trim(' ', '\"') });
                }

            bool terminate = false;
            while (!terminate)
            {
                Console.WriteLine("Enter zip code:");
                var userEntry = Console.ReadLine();
                if (userEntry.ToLower() == "x" || userEntry.ToString() == "q")
                    terminate = true;
                else
                {
                    DateTime dtStart = DateTime.Now;
                    foreach (var arrayVal in zipCodes.Where(z => z.ZipCode == userEntry.PadLeft(5, '0')))
                        Console.WriteLine(string.Format("ZipCode: {0}", arrayVal.ZipCode).PadRight(20, ' ') + string.Format("City: {0}", arrayVal.City));
                    DateTime dtStop = DateTime.Now;
                    Console.WriteLine();
                    Console.WriteLine("Lookup time: {0}", dtStop.Subtract(dtStart).ToString());
                    Console.WriteLine("\n\n");
                }
            }
        }
    }
}
Mike U
  • 709
  • 6
  • 11
0

In this particular case, it is quite possible that a hash will be faster. However, the range-based solution will use a lot less memory, so it would be appropriate if your lists were very large (and I'm not convinced that there are enough possible zipcodes for any list of zipcodes to be large enough.)

Anyway, here's a simpler logic for making the range list and finding if a target is in a range:

  1. Make ranges a simple list of integers (or even zipcodes), and push the first element of zip as its first element.

  2. For each element of zip except the last one, if that element plus one is not the same as the next element, add both that element plus one and the next element to ranges.

  3. Push one more than the last element of zip at the end of `ranges.

Now, to find out if a zipcode is in ranges, do a binary search into ranges for the smallest element which is greater than the target zipcode. [Note 1] If the index of that element is odd, then the target is in one of the ranges, otherwise it isn't.


Notes:

AIUI, the BinarySearch method on a C# list returns the index of the element found or the complement of the index of the first larger element. To get the result needed by the suggested algorithm, you could use something like index >= 0 ? index + 1 : ~index, but it might be simpler to just search for the zipcode one less than the target and then use the complement of the low-order bit of the result.

Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341