2

Scenario: I want to write my own Autocomplete-API for Addresses, just like the One Google is offering. (Very Basic: Street, Housenumber, City, Postcode, Country). It is intended for private use and training-purposes only. I want to cover about 1 Million Addresses for a Start.

Technology Used: .Net Framework (not Core), C#, Visual Studio, OSMSharp, Microsoft SQL-Server, Web Api 2 (although i will probably switch to ASP.Net Core in the Future.)

Approach:

  • Set Up Project (Web Api 2 or Console Project for Demo-Purposes)
  • Download relevant File from OpenStreetMaps using DownloadClient() (https://download.geofabrik.de/)
  • Read in the File using OSMSharp and Filter out relevant Data.
  • Convert Filtered Data to a DataTable.
  • Use DataTable to feed SQLBulkCopy Method to import Data into Database.

Problem: Step 4 is taking way too long. For a File like "Regierungsbezirk Köln" in the Format osm.pbf which is about 160MB (the uncompressed osm file is about 2.8 GB) where talking about 4-5 Hours. I want to optimize this. The Bulk Copy of the DataTable into the Database on the other Hand (About 1 Million Rows) is taking just about 5 Seconds. (Woah. Amazing.)

Minimal Reproduction: https://github.com/Cr3pit0/OSM2Database-Minimal-Reproduction

What i tried:

  • Use a Stored Procedure in SQL-Server. This comes with a whole different Set of Problems and i didn't quite manage to get it Working (mainly because the uncompressed osm.pbf File is over 2GB and SQL Server doesn't like that)

  • Come up with a different approach to Filter and Convert the Data from the File to a DataTable (or CSV).

  • Use the Overpass-API. Although I read somewhere that the Overpass-API is not intended for DataSets above 10,000 Entries.

  • Ask the Jedi-Grandmasters on StackOverflow for Help. (Currently in Process ... :D)

Code Extract:

public static DataTable getDataTable_fromOSMFile(string FileDownloadPath)
{

    Console.WriteLine("Finished Downloading. Reading File into Stream...");

    using (var fileStream = new FileInfo(FileDownloadPath).OpenRead())
    {
        PBFOsmStreamSource source = new PBFOsmStreamSource(fileStream);

        if (source.Any() == false)
        {
            return new DataTable();
        }

        Console.WriteLine("Finished Reading File into Stream. Filtering and Formatting RawData to Addresses...");
        Console.WriteLine();

        DataTable dataTable = convertAdressList_toDataTable(
                    source.Where(x => x.Type == OsmGeoType.Way && x.Tags.Count > 0 && x.Tags.ContainsKey("addr:street"))
                    .Select(Address.fromOSMGeo)
                    .Distinct(new AddressComparer())
                );

        return dataTable;
    }
};
private static DataTable convertAdressList_toDataTable(IEnumerable<Address> addresses)
{
    DataTable dataTable = new DataTable();

    if (addresses.Any() == false)
    {
        return dataTable;
    }

    dataTable.Columns.Add("Id");
    dataTable.Columns.Add("Street");
    dataTable.Columns.Add("Housenumber");
    dataTable.Columns.Add("City");
    dataTable.Columns.Add("Postcode");
    dataTable.Columns.Add("Country");

    Int32 counter = 0;

    Console.WriteLine("Finished Filtering and Formatting. Writing Addresses From Stream to a DataTable Class for the Database-SQLBulkCopy-Process ");

    foreach (Address address in addresses)
    {
        dataTable.Rows.Add(counter + 1, address.Street, address.Housenumber, address.City, address.Postcode, address.Country);
        counter++;

        if (counter % 10000 == 0 && counter != 0)
        {
            Console.WriteLine("Wrote " + counter + " Rows From Stream to DataTable.");
        }
    }

    return dataTable;
};
plr108
  • 1,201
  • 11
  • 16
Cr3pit0
  • 41
  • 5
  • Did you already try to use a profiler in order to determine your bottleneck(s)? – scai Jan 10 '20 at 11:18
  • 1
    You'd need to show us your code for how you generate your `DataTable`, as that is where your bottle neck is. See this answer: https://stackoverflow.com/a/564373/3181 – Stuart Grassie Jan 10 '20 at 11:41
  • All of the Relevant Code is in the Project of github. I dont know if its appreciated if i just copy and paste dozens of lines of code into this Question. As for the Bottleneck. I used simple Logs which include DateTimes to determine which Operation is taking the longest. – Cr3pit0 Jan 10 '20 at 13:37
  • @StuartGrassie FastMember sounds interesting, although i'm not sure if that is what i need. I will check it out when i have the Time. – Cr3pit0 Jan 10 '20 at 13:51
  • As @StuartGrassie says, your `DataTable` populating code is **hideously** inefficient as you're populating row-by-row. FastMember or similar is your best bet. – Ian Kemp Jan 10 '20 at 21:35

1 Answers1

2

Okay i think i got it. Im down to about 12 Minutes for a File-Size of about 600mb and about 3.1 Million Rows of Data after Filtering.

The first Thing i tried is to replace the logic that populates my DataTable with FastMember. Which worked, but didnt give the Performance Increase i was hoping for (I canceled the Process after 3 Hours...). After more Research i stumbled upon an old Project which is called "osm2mssql" (https://archive.codeplex.com/?p=osm2mssql). I used a little part of the Code which directly read the Data from the osm.pbf File and modified it to my Use-Case ( → which is to extract Address-Data from Ways). I did actually use FastMember to write an IEnumerable<Address> to the Datatable, but i dont need OSM-Sharp and whatever extra Dependencies they have anymore. So thank you very much for the Suggestion of FastMember. I will certainly keep that Library in Mind in future Projects.

For those who are interested, i updated my Github-Project accordingly (https://github.com/Cr3pit0/OSM2Database-Minimal-Reproduction) (although i didnt thoroughly test it, because i moved on from the Test-Project to the Real Deal, which is a Web Api)

Im quite sure it can be further optimized but i dont think i care at the Moment. 12 Minutes for a Method which might be called once a month to update the whole Database is fine i guess. Now i can move on to opimizing my Queries for the Autocomplete.

So thank you very much to whoever wrote "osm2mssql".

Cr3pit0
  • 41
  • 5