1

I appreciate similar questions have been asked below, but none of the ones I saw went above tens of thousands of lines. These, while 'large' are relatively trivial even if code is quite slow. In my case, I'm dealing with millions to hundreds of millions of lines so even small optimizations could be quite useful.

The answers I've been relying on thus far are here and here. The upgrades I've made are not amazing...

For a broad overview, I have a reference file of approximately 2,000,000 lines (format easting-northing-value) from where I look for the closest value to ProcessedLineData (I cannot guarantee that an exact match exists in my list so I need to compute throughout the list for every instance to find the closest possible point that works).

The first option I had, which was incredibly slow, was:

AsciiFile.Sample _closestSample = Modifications.Value_and_Ref_Calc.ReferenceFile.Dataset[0].Data
                            .OrderBy(t => Public.CalculateDistance(t.Easting, t.Northing, ProcessedLineData.Easting, ProcessedLineData.Northing))
                            .First();

After this, I thought that the second example's Aggregate would work better so I went for this:

AsciiFile.Sample _closestSample = Modifications.Value_and_Ref_Calc.ReferenceFile.Dataset[0].Data
                            .Aggregate((x, y) => 
                              Public.CalculateDistance(x.Easting, x.Northing, ProcessedLineData.Easting, ProcessedLineData.Northing) 
                              < 
                              Public.CalculateDistance(y.Easting, y.Northing, ProcessedLineData.Easting, ProcessedLineData.Northing) 
                              ? x : y);

This is still quite slow (pretty much the same as the previous one). I then thought of using the ToLookup() that's also described in one of the examples but I didn't see what specific benefits it would bring.

In case it is relevant, Public.CalculateDistance is:

public static double CalculateDistance(double _x1, double _y1, double _x2, double _y2)
        {
            return Math.Sqrt(Math.Pow(_x1 - _x2, 2) + Math.Pow(_y1 - _y2, 2));
        }

In these examples, Modifications.Value_and_Ref_Calc.ReferenceFile.Dataset[0].Data is my list of the 2,000,000 reference points.

This would be less important if I didn't have hundreds of millions of ProcessedLineData points to which I am finding the closest values (well, numerous files from small ones with tens of thousands of points up to big ones with tens of millions of points)...

Final purpose: I am finding the closest value so that I would be able to use the elevation associated with the specific Modifications.Value_and_Ref_Calc.ReferenceFile.Dataset[0].Data and use it to modify the value in my ProcessedLineData. The slowest part of this entire sequence is definitely this search for my closest values.

Are there any obvious methods for optimising this code that I am missing?

gktscrk
  • 173
  • 2
  • 12
  • What about a simple loop and comparing the values yourself instead of relying on Linq? It introduces overhead that could cause performance loss – Biesi Oct 16 '19 at 11:03
  • @BiesiGrr: So I could break the loop once I come into an "acceptable" range from my points? – gktscrk Oct 16 '19 at 11:05
  • Not necessarily, it's just less overhead. Also, you're calculating distances multiple times for the same sample – Biesi Oct 16 '19 at 11:49
  • @BiesiGrr: Do you mean that it is multiple times because of the `Aggregate` or in general because of how the LINQ operation works? – gktscrk Oct 16 '19 at 12:03
  • Because of the `Func` you pass to `Aggregate`. `x` is the currently closest sample and its distance is calculated again for each comparison – Biesi Oct 16 '19 at 12:08
  • What is the distribution of lookups versus changes in the `Data` list? Why are you doing `Sqrt` when finding the minimum distance? – NetMage Oct 16 '19 at 22:49
  • Can you assume anything about the distribution of `Northing` and `Easting`? – NetMage Oct 16 '19 at 23:33
  • Also, can you say anything about the precision you have for `Northing`/`Easting` (e.g. number of decimal places)? – NetMage Oct 17 '19 at 00:35
  • @NetMage: Two/three decimal places. The one thing that can be assumed is that both files cover more or less the same area (even if a point is far away from one point in the other file, it should be close to some other point in that second file). – gktscrk Oct 17 '19 at 12:50
  • @NetMage: Square root... It's the formula for distance between two points. – gktscrk Oct 17 '19 at 12:52
  • 1
    Since the square root is only taken on positive integers in the distance function, and square root is monotonic for that range, you don't need the square root when attempting to find the minimum distance between a set of points, so it is just slowing you down. – NetMage Oct 17 '19 at 17:27
  • 1
    I updated my answer with what is probably the best solution for large amounts of lookups. – NetMage Oct 19 '19 at 00:46

2 Answers2

2

Without knowing anything more about the range and precision of the values, or assuming too much about the distribution of lookups versus reference point list changes, some simple optimizations to a for loop yields about a 30x speedup for 100 lookups over the faster OrderBy/First code:

Using pld for your ProcessedLineData and data for Modifications.Value_and_Ref_Calc.ReferenceFile.Dataset[0].Data you get:

var _closestSample = data[0];
var dist = (_closestSample.Easting - pld.Easting) * (_closestSample.Easting - pld.Easting) + (_closestSample.Northing - pld.Northing) * (_closestSample.Northing - pld.Northing);
for (int j2 = 1; j2 < data.Count; ++j2) {
    var y = data[j2];
    var ydist = (y.Easting - pld.Easting) * (y.Easting - pld.Easting) + (y.Northing - pld.Northing) * (y.Northing - pld.Northing);
    if (ydist < dist) {
        dist = ydist;
        _closestSample = y;
    }
}

With my timing over a 2,000,000 entry data list and 100 lookups, the OrderBy/First takes 2.22 seconds and the for takes 0.06 seconds for a 32x speedup.

So, I was sure there was a better way than brute force, and after some research, discovered Morton Codes and Hilbert Curves. A little work generated a SpatialIndex class using the Hilbert Curve and a SpatialIndexMorton class using the Morton Index. I also adjusted the Hilbert Index to only index bits 16-32, which provided the optimal lookups per second. For my data, the Morton Curve is somewhat faster now.

Using the same random data tests, I get that the for method can do 147 lookups per second, the Hilbert Index 5634 lookups per second and the Morton Index 7370 lookups per second over 10,000 lookups with 2,000,000 reference points. Note that the spatial indexes have about 3 seconds of setup time, so for very few lookups it is faster to do brute force with for - I get a break even time around 468 lookups.

In order to make this (somewhat) general purpose, I started with a (C# 8.0) interface for Earth coordinates that provides some helper methods:

public interface ICoordinate {
    double Longitude { get; set; }
    double Latitude { get; set; }

    public ulong MortonCode() {
        float f = (float)Latitude;
        uint ui;
        unsafe { // perform unsafe cast (preserving raw binary)
            float* fRef = &f;
            ui = *((uint*)fRef);
        }
        ulong ixl = ui;

        f = (float)Longitude;
        unsafe { // perform unsafe cast (preserving raw binary)
            float* fRef = &f;
            ui = *((uint*)fRef);
        }
        ulong iyl = ui;

        ixl = (ixl | (ixl << 16)) & 0x0000ffff0000ffffL;
        iyl = (iyl | (iyl << 16)) & 0x0000ffff0000ffffL;

        ixl = (ixl | (ixl << 8)) & 0x00ff00ff00ff00ffL;
        iyl = (iyl | (iyl << 8)) & 0x00ff00ff00ff00ffL;

        ixl = (ixl | (ixl << 4)) & 0x0f0f0f0f0f0f0f0fL;
        iyl = (iyl | (iyl << 4)) & 0x0f0f0f0f0f0f0f0fL;

        ixl = (ixl | (ixl << 2)) & 0x3333333333333333L;
        iyl = (iyl | (iyl << 2)) & 0x3333333333333333L;

        ixl = (ixl | (ixl << 1)) & 0x5555555555555555L;
        iyl = (iyl | (iyl << 1)) & 0x5555555555555555L;

        return ixl | (iyl << 1);
    }

    const int StartBitMinus1 = 31;
    const int EndBit = 16;

    //convert (x,y) to 31-bit Hilbert Index
    public ulong HilbertIndex() {
        float f = (float)Latitude;
        uint x;
        unsafe { // perform unsafe cast (preserving raw binary)
            float* fRef = &f;
            x = *((uint*)fRef);
        }

        f = (float)Longitude;
        uint y;
        unsafe { // perform unsafe cast (preserving raw binary)
            float* fRef = &f;
            y = *((uint*)fRef);
        }

        ulong hi = 0;
        for (int bitpos = StartBitMinus1; bitpos >= EndBit; --bitpos) {
            // extract s'th bit from x & y
            var rx = (x >> bitpos) & 1;
            var ry = (y >> bitpos) & 1;
            hi <<= 2;
            hi += (rx << 1) + (rx ^ ry);

            //rotate/flip a quadrant appropriately
            if (ry == 0) {
                if (rx == 1) {
                    x = ~x;
                    y = ~y;
                }

                //Swap x and y
                uint t = x;
                x = y;
                y = t;
            }
        }
        return hi;
    }
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public double DistanceTo(ICoordinate b) =>
        Math.Sqrt((Longitude - b.Longitude) * (Longitude - b.Longitude) + (Latitude - b.Latitude) * (Latitude - b.Latitude));
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public double Distance2To(ICoordinate b) => (Longitude - b.Longitude) * (Longitude - b.Longitude) + (Latitude - b.Latitude) * (Latitude - b.Latitude);

    public ICoordinate MakeNew(double plat, double plong);
}

public static class ICoordinateExt {
    public static ICoordinate Minus(this ICoordinate a, ICoordinate b) =>
        a.MakeNew(a.Latitude - b.Latitude, a.Longitude - b.Longitude);
    public static ICoordinate Plus(this ICoordinate a, ICoordinate b) =>
        a.MakeNew(a.Latitude + b.Latitude, a.Longitude + b.Longitude);
}

Then, a real class to implement the interface (which would be replaced with your real class):

public class PointOfInterest : ICoordinate {
    public double Longitude { get; set; }
    public double Latitude { get; set; }

    public PointOfInterest(double plat, double plong) {
        Latitude = plat;
        Longitude = plong;
    }

    public ICoordinate MakeNew(double plat, double plong) => new PointOfInterest(plat, plong);
}

And a class to convert a IEnumerable<ICoordinate> into a spatially indexed collection of ICoordinate using the Hilbert Curve:

public class SpatialIndex {
    SortedList<ulong, List<ICoordinate>> orderedData;
    List<ulong> orderedIndexes;

    public SpatialIndex(IEnumerable<ICoordinate> data) {
        orderedData = data.GroupBy(d => d.HilbertIndex()).ToSortedList(g => g.Key, g => g.ToList());
        orderedIndexes = orderedData.Keys.ToList();
    }

    public ICoordinate FindNearest(ICoordinate aPoint) {
        var hi = aPoint.HilbertIndex();
        var nearestIndex = orderedIndexes.FindNearestIndex(hi);
        var nearestGuess = orderedData.Values[nearestIndex][0];
        var guessDist = (nearestGuess.Longitude - aPoint.Longitude) * (nearestGuess.Longitude - aPoint.Longitude) + (nearestGuess.Latitude - aPoint.Latitude) * (nearestGuess.Latitude - aPoint.Latitude);
        if (nearestIndex > 0) {
            var tryGuess = orderedData.Values[nearestIndex-1][0];
            var tryDist = (tryGuess.Longitude - aPoint.Longitude) * (tryGuess.Longitude - aPoint.Longitude) + (tryGuess.Latitude - aPoint.Latitude) * (tryGuess.Latitude - aPoint.Latitude);
            if (tryDist < guessDist) {
                nearestGuess = tryGuess;
                guessDist = tryDist;
            }
        }

        var offsetPOI = new PointOfInterest(guessDist, guessDist);
        var minhi = (aPoint.Minus(offsetPOI)).HilbertIndex();
        var minhii = orderedIndexes.FindNearestIndex(minhi);
        if (minhii > 0)
            --minhii;
        var maxhi = (aPoint.Plus(offsetPOI)).HilbertIndex();
        var maxhii = orderedIndexes.FindNearestIndex(maxhi);
        for (int j2 = minhii; j2 < maxhii; ++j2) {
            var tryList = orderedData.Values[j2];
            for (int j3 = 0; j3 < tryList.Count; ++j3) {
                var y = tryList[j3];
                var ydist = (y.Longitude - aPoint.Longitude) * (y.Longitude - aPoint.Longitude) + (y.Latitude - aPoint.Latitude) * (y.Latitude - aPoint.Latitude);
                if (ydist < guessDist) {
                    nearestGuess = y;
                    guessDist = ydist;
                }
            }
        }

        return nearestGuess;
    }
}

And a similar class using the Morton Curve:

public class SpatialIndexMorton {
    SortedList<ulong, List<ICoordinate>> orderedData;
    List<ulong> orderedIndexes;

    public SpatialIndexMorton(IEnumerable<ICoordinate> data) {
        orderedData = data.GroupBy(d => d.MortonCode()).ToSortedList(g => g.Key, g => g.ToList());
        orderedIndexes = orderedData.Keys.ToList();
    }

    public ICoordinate FindNearest(ICoordinate aPoint) {
        var mc = aPoint.MortonCode();
        var nearestIndex = orderedIndexes.FindNearestIndex(mc);
        var nearestGuess = orderedData.Values[nearestIndex][0];
        var guessDist = (nearestGuess.Longitude - aPoint.Longitude) * (nearestGuess.Longitude - aPoint.Longitude) + (nearestGuess.Latitude - aPoint.Latitude) * (nearestGuess.Latitude - aPoint.Latitude);
        if (nearestIndex > 0) {
            var tryGuess = orderedData.Values[nearestIndex-1][0];
            var tryDist = (tryGuess.Longitude - aPoint.Longitude) * (tryGuess.Longitude - aPoint.Longitude) + (tryGuess.Latitude - aPoint.Latitude) * (tryGuess.Latitude - aPoint.Latitude);
            if (tryDist < guessDist) {
                nearestGuess = tryGuess;
                guessDist = tryDist;
            }
        }

        var offsetPOI = new PointOfInterest(guessDist, guessDist);
        var minmc = (aPoint.Minus(offsetPOI)).MortonCode();
        var minmci = orderedIndexes.FindNearestIndex(minmc);
        if (minmci > 0)
            --minmci;
        var maxmc = (aPoint.Plus(offsetPOI)).MortonCode();
        var maxmci = orderedIndexes.FindNearestIndex(maxmc);
        for (int j2 = minmci; j2 < maxmci; ++j2) {
            var tryList = orderedData.Values[j2];
            for (int j3 = 0; j3 < tryList.Count; ++j3) {
                var y = tryList[j3];
                var ydist = (y.Longitude - aPoint.Longitude) * (y.Longitude - aPoint.Longitude) + (y.Latitude - aPoint.Latitude) * (y.Latitude - aPoint.Latitude);
                if (ydist < guessDist) {
                    nearestGuess = y;
                    guessDist = ydist;
                }
            }
        }

        return nearestGuess;
    }
}

And some helper extension methods:

public static class ListExt {
    public static int FindNearestIndex<T>(this List<T> l, T possibleKey) {
        var keyIndex = l.BinarySearch(possibleKey);
        if (keyIndex < 0) {
            keyIndex = ~keyIndex;
            if (keyIndex == l.Count)
                keyIndex = l.Count - 1;
        }
        return keyIndex;
    }
}

public static class IEnumerableExt {
    public static SortedList<TKey, TValue> ToSortedList<T, TKey, TValue>(this IEnumerable<T> src, Func<T, TKey> keySelector, Func<T, TValue> valueSelector) =>
        new SortedList<TKey, TValue>(src.ToDictionary(keySelector, valueSelector));    
}

Finally, some sample code using this, with your references in data, and your lookup values in plds:

var hilbertIndex = new SpatialIndex(data);
var ans = new (ICoordinate, ICoordinate)[lookups];
for (int j1 = 0; j1 < lookups; ++j1) {
    ICoordinate pld = plds[j1];
    ans[j1] = (pld, hilbertIndex.FindNearest(pld));
}

UPDATE: I modified the find nearest algorithm to take the closest of the point above and below the target point on the index instead of just trying the one above. This provides another decent speedup.

NetMage
  • 26,163
  • 3
  • 34
  • 55
0

You could try reducing the range of each search by using a grid to divide the plane to equal squares. Each element would be stored into the bucket of the appropriate square. Then you could use this grid to perform searches, by starting from the square that contains the searched point, and spiraling outwards until you find one or more populated squares in the perimeter of the spiral. This is a naive algorithm, but can perform surprisingly well under specific conditions:

  1. The distribution of the elements in the plane must be random (or close to random). If most elements are crowded in a few populated areas, the performance will suffer.

  2. The searched points will be in the general populated area of the map. Searching far-far away, or inside vast unpopulated areas, could take forever.

Configuring correctly the size of the grid's squares is important. Too large or too small squares can equally hurt the performance. Too small squares will also increase the memory consumption, because each populated square requires its own bucket. Under perfect conditions the algorithm can perform more than x1,000 faster compared to a simple loop.

public class SpatialDictionary<T> : IEnumerable<T>
{
    private readonly Dictionary<(int, int), List<T>> _dictionary;
    private readonly double _squareSize;
    private readonly Func<T, (double, double)> _locationSelector;
    private int _count;

    public int Count => _count;

    public SpatialDictionary(
        double squareSize, Func<T, (double, double)> locationSelector)
    {
        if (squareSize <= 0)
            throw new ArgumentOutOfRangeException(nameof(squareSize));
        _squareSize = squareSize;
        _locationSelector = locationSelector
            ?? throw new ArgumentNullException(nameof(locationSelector));
        _dictionary = new Dictionary<(int, int), List<T>>();
    }

    public void Add(T item)
    {
        var (itemX, itemY) = _locationSelector(item);
        int keyX = checked((int)(itemX / _squareSize));
        int keyY = checked((int)(itemY / _squareSize));
        if (!_dictionary.TryGetValue((keyX, keyY), out var bucket))
        {
            bucket = new List<T>(1);
            _dictionary.Add((keyX, keyY), bucket);
        }
        bucket.Add(item);
        _count++;
    }

    public T FindClosest(double x, double y)
    {
        if (_count == 0) throw new InvalidOperationException();
        int keyX = checked((int)(x / _squareSize));
        int keyY = checked((int)(y / _squareSize));
        double minDistance = Double.PositiveInfinity;
        T minItem = default;
        int radius = 0;
        while (true)
        {
            checked { radius++; }
            foreach (var square in GetSquares(keyX, keyY, radius))
            {
                if (!_dictionary.TryGetValue(square, out var bucket)) continue;
                foreach (var item in bucket)
                {
                    var (itemX, itemY) = _locationSelector(item);
                    var distX = x - itemX;
                    var distY = y - itemY;
                    var distance = Math.Abs(distX * distX + distY * distY);
                    if (distance < minDistance)
                    {
                        minDistance = distance;
                        minItem = item;
                    }
                }
            }
            if (minDistance != Double.PositiveInfinity) return minItem;
        }
    }

    private IEnumerable<(int, int)> GetSquares(int x, int y, int radius)
    {
        if (radius == 1) yield return (x, y);
        for (int i = -radius; i < radius; i++)
        {
            yield return checked((x + i, y + radius));
            yield return checked((x - i, y - radius));
            yield return checked((x + radius, y - i));
            yield return checked((x - radius, y + i));
        }
    }

    public IEnumerator<T> GetEnumerator()
        => _dictionary.Values.SelectMany(b => b).GetEnumerator();

    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}

Usage example:

var spatialData = new SpatialDictionary<DataRow>(squareSize: 10.0,
    dr => (dr.Field<double>("Easting"), dr.Field<double>("Northing")));

foreach (DataRow dataRow in dataTable.Rows)
{
    spatialData.Add(dataRow);
}

DataRow result = spatialData.FindClosest(100.0, 100.0);
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104