3

The CSV file

I have a CSV file that's about 1.3 GB in size:

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a----         10/4/2021   1:23 PM     1397998768 XBTUSD.csv

This is the complete list of trading data for Bitcoin on the Kraken exchange.

Here's what the data in the CSV looks like:

> Get-Content .\XBTUSD.csv | Select-Object -First 10
1381095255,122.00000,0.10000000
1381179030,123.61000,0.10000000
1381201115,123.91000,1.00000000
1381201115,123.90000,0.99160000
1381210004,124.19000,1.00000000
1381210004,124.18000,1.00000000
1381311039,124.01687,1.00000000
1381311093,124.01687,1.00000000
1381311094,123.84000,0.82300000
1381431835,125.85000,1.00000000

More information about the file is available here:

https://support.kraken.com/hc/en-us/articles/360047543791-Downloadable-historical-market-data-time-and-sales

The file can be downloaded from here:

https://drive.google.com/drive/folders/1jI3mZvrPbInNAEaIOoMbWvFfgRDZ44TT

See the file XBT.zip. Inside that archive is XBTUSD.csv.

Baseline test - importing directly into sqlite

If I create the following table in sqlite:

CREATE TABLE CsvTrades (
    "TimeStamp" TEXT NOT NULL,
    "Price"     TEXT NOT NULL,
    "Volume"    TEXT NOT NULL
);

and run the following to import the CSV (as well as time how long it takes):

$a = Get-Date

sqlite3.exe .\kraken-trades.db -cmd '.mode csv' '.import C:/Users/dharm/XBTUSD.csv CsvTrades'

$b = Get-Date

($b - $a).TotalMinutes

I get the following:

1.56595191666667

1.5 minutes. Not bad!

Using EF Core

In the code below, I'm using the CsvHelper package:

https://joshclose.github.io/CsvHelper/getting-started/

Here's a class for the CSV file rows:

public class CsvRow
{
    [CsvHelper.Configuration.Attributes.Index(0)]
    public long TimeStamp { get; set; }

    [CsvHelper.Configuration.Attributes.Index(1)]
    public decimal Price { get; set; }

    [CsvHelper.Configuration.Attributes.Index(2)]
    public decimal Quantity { get; set; }
}

Here's a class for the Trade entity:

[Index(nameof(TimeStamp))]
public class Trade
{
    public int Id { get; set; }
    public decimal Price { get; set; }
    public decimal Quantity { get; set; }
    public DateTime TimeStamp { get; set; }
}

The DbContext is straightforward:

public class AppContext : DbContext
{
    public DbSet<Trade> Trades { get; set; }

    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
    {
        var folder = Environment.SpecialFolder.LocalApplicationData;

        var path = Environment.GetFolderPath(folder);

        var db_path = $"{path}{System.IO.Path.DirectorySeparatorChar}kraken-trades.db";
                
        optionsBuilder.UseSqlite($"Data Source={db_path}");
    }
}

And finally, the function that performs the import:

void initialize_from_csv()
{
    var config = new CsvConfiguration(CultureInfo.InvariantCulture)
    {
        HasHeaderRecord = false
    };

    using (var reader = new StreamReader(@"C:\Users\dharm\XBTUSD.csv"))
    using (var csv = new CsvReader(reader, config))
    {
        var records = csv.GetRecords<CsvRow>().Select(row => new Trade()
        {
            Price = row.Price,
            Quantity = row.Quantity,
            TimeStamp = DateTimeOffset.FromUnixTimeSeconds(row.TimeStamp).UtcDateTime
        });

        using (var db = new AppContext())
        {
            Console.WriteLine(DateTime.Now);
                        
            while (true)
            {
                //var items = records.Take(10_000).ToList();

                var items = records.Take(100_000).ToList();

                if (items.Any() == false) break;

                Console.WriteLine("{0:yyyy-MM-dd}", items[0].TimeStamp);

                db.AddRange(items);
                db.SaveChanges();
            }

            Console.WriteLine(DateTime.Now);
        }
    }
}

The question

When I let this run, it does indeed proceed to add the items to the database. However, it's pretty slow; I haven't timed it to completion but I could see it taking over an hour.

Is there a good way to make this faster while still using EF Core?

Notes

The code referenced above is available here in a single file:

https://github.com/dharmatech/kraken-trades-database/blob/003-minimal/KrakenTradesDatabase/Program.cs

It's a .NET 6 project. Let me know if you have any issues building and running it.

Timing

I added some code to time the batch adds. It looks like it's taking around 7 seconds per 100,000 records.

Starting batch at 2013-10-06. Batch took 00:00:08.7689932.
Starting batch at 2015-12-08. Batch took 00:00:06.7453421.
Starting batch at 2016-04-19. Batch took 00:00:06.7833506.
Starting batch at 2016-06-25. Batch took 00:00:06.7083806.
Starting batch at 2016-08-22. Batch took 00:00:06.7826717.
Starting batch at 2016-11-20. Batch took 00:00:06.4212123.

wc says there are 41,695,261 lines:

$ wc -l XBTUSD.csv
41695261 XBTUSD.csv

So at that rate, it would take around 48 minutes.

Why EF Core?

Some folks have asked, why use EF Core for this? Why not just a direct import?

The example above was intentionally simplified to focus on the speed of the import.

I have more elaborate versions where there are relationships with other entities. In that case:

  • Setting up the other tables and foreign key properties is more straightforward with EF Core.

  • I can more easily switch between database backends (SQL Server, PostgreSQL, sqlite).

See for example this branch where multiple symbols are imported. There's a relationship between the Trade and Symbol there. There could be other relationships as well.

https://github.com/dharmatech/kraken-trades-database/blob/006/KrakenTradesDatabase/Program.cs

dharmatech
  • 8,979
  • 8
  • 42
  • 88
  • 6
    Entity Framework (or any ORM for that matter) sacrifices performance for convenience. Almost everything you do through an ORM will be slower than if you were to pass the query to the server as a string. Another factor is that in your first example, the database server itself is reading the file and importing the results directly. EF has to send the text over the network, which will be slower. The final factor (that I can think of) is that you're executing `db.SaveChanges()` multiple times. Every time you do this, it executes a new query on the server. This is also slower than doing it at once. – Jesse Dec 03 '21 at 00:15
  • 5
    Bulk operations is not what EF Core is good at. At best you could create a transaction and reuse a prepared query from raw sql. – Jeremy Lakeman Dec 03 '21 at 00:16
  • 1
    You could reset the change tracker after each save changes (https://learn.microsoft.com/en-us/dotnet/api/microsoft.entityframeworkcore.changetracking.changetracker.clear?view=efcore-5.0). And turn off `.AutoDetectChangesEnabled`. That should reduce the Big O complexity a bit. But bypassing the context is going to be quicker. – Jeremy Lakeman Dec 03 '21 at 00:18
  • 1
    You can use SQLBulk instead Entity Framework, or an hybrid like I use in this answer: https://stackoverflow.com/a/69574353/888472 – Leandro Bardelli Dec 03 '21 at 00:38
  • 1
    BTW, the answer to your question is: NO. And DO NOT DO THIS WITH EF – Leandro Bardelli Dec 03 '21 at 00:39
  • 2
    I get about 15000 rows a second streaming a csv over signalr with the bulk extensions: https://github.com/borisdj/EFCore.BulkExtensions – Caius Jard Dec 03 '21 at 00:40
  • @CaiusJard Looks like I'm getting around 14,000 rows a second with the above code. However, I'm not streaming the data over signalr as you mention. – dharmatech Dec 03 '21 at 00:44
  • @CaiusJard Thank you for the reference to that library! – dharmatech Dec 03 '21 at 00:45
  • 1
    I should have mentioned that that's into an Azure SQL db on the lowest paid for performance tier, in another country from the source machine.. but critically, i only get about 1000 to 2000 rows a second if I don't use the bulk extensions ; it helped a lot.. I hope your local streaming will be faster! – Caius Jard Dec 03 '21 at 00:47
  • @CaiusJard, omgosh omgosh omgosh... Wow... it went from 48 minutes to 5.7 minutes.... That's totally reasonable! Thank you Caius! – dharmatech Dec 03 '21 at 07:49
  • 1
    @CaiusJard, I also donated to the project: https://github.com/borisdj/EFCore.BulkExtensions/issues/675 – dharmatech Dec 03 '21 at 07:55
  • @LeandroBardelli, see the answer posted below. Using EFCore.BulkExtensions, the import is now practical via EF Core. – dharmatech Dec 03 '21 at 07:57
  • @JeremyLakeman see the answer posted below. Using EFCore.BulkExtensions, the import is now practical via EF Core. – dharmatech Dec 03 '21 at 07:57
  • 1
    @dharmatech you're welcome, and many thanks for the offer - it means every bit as much as actually receiving lunch would! You can make a small donation to a worthy charitable cause of your choice, in its stead if you like – Caius Jard Dec 03 '21 at 10:51
  • 1
    Change Context to QueryTrackingBehaviour.NoTracking() – Aaron Glover Dec 04 '21 at 07:07

2 Answers2

6

EFCore.BulkExtensions

Using the following:

https://github.com/borisdj/EFCore.BulkExtensions

and then changing this line:

db.AddRange(items);

to:

db.BulkInsert(items);

makes the import go from 48 minutes to 5.7 minutes.

This version of the project is available here:

https://github.com/dharmatech/kraken-trades-database/blob/004-bulk-extensions/KrakenTradesDatabase/Program.cs

Thanks

Thanks to Caius Jard who suggested EFCore.BulkExtensions in a comment above.

dharmatech
  • 8,979
  • 8
  • 42
  • 88
  • Nice answer, but what about the memory consumption of the objects created? This was already studied, there are a lot of questions and comments about BulkExtensions here in S.O. and is still better using a minimal work and/or hybrid of the ado net – Leandro Bardelli Dec 03 '21 at 12:40
1

Here is a complete C# (10.0) program that inserts the CSV data faster than the sqlite3 tool. It uses my Sylvan.Data.Csv library, which is the fastest CSV parser for .NET.

On my machine sqlite3 will insert the data in 1:07.6, and my code inserts in 1:02.9.

While this doesn't meet your requirement of "still using EFCore", I think the difference in performance speaks for itself.

Packages:

<PackageReference Include="Sylvan.Data.Csv" Version="1.1.9" />
<PackageReference Include="System.Data.SQLite" Version="1.0.115.5" />

Code:

using System.Collections.ObjectModel;
using System.Data.Common;
using System.Data.SQLite;
using System.Diagnostics;
using Sylvan.Data.Csv;

var sw = Stopwatch.StartNew();

var conn = new SQLiteConnection("Data Source=test.db");
conn.Open();

var data = CsvDataReader.Create("xbtusd.csv", new CsvDataReaderOptions { HasHeaders = false });


// create the target table
{
    using var cmd = conn.CreateCommand();
    var tbl = "create table CsvTrades (TimeStamp TEXT NOT NULL, Price TEXT NOT NULL, Volume TEXT NOT NULL)";
    cmd.CommandText = tbl;
    cmd.ExecuteNonQuery();
}

// get the schema for the target table.
ReadOnlyCollection<DbColumn> ss;
{
    using var cmd = conn.CreateCommand();
    cmd.CommandText = "select * from CsvTrades limit 0;";
    var r = cmd.ExecuteReader();
    ss = r.GetColumnSchema();
}

// create the parameterized insert command
var cmdW = new StringWriter();
cmdW.Write("insert into CsvTrades values(");
int i = 0;
foreach (var c in ss)
{
    if (i > 0)
        cmdW.Write(",");
    cmdW.Write("$p" + i++);
}

cmdW.Write(");");
var cmdt = cmdW.ToString();

// insert CSV data.
using (var tx = conn.BeginTransaction())
{
    var cmd = conn.CreateCommand();
    cmd.CommandText = cmdt;
    for (i = 0; i < data.FieldCount; i++)
    {
        var p = cmd.CreateParameter();
        p.ParameterName = "$p" + i;
        cmd.Parameters.Add(p);
    }
    cmd.Prepare();
    while (data.Read())
    {
        for (i = 0; i < data.FieldCount; i++)
        {
            cmd.Parameters[i].Value = data.GetValue(i);
        }
        cmd.ExecuteNonQuery();
    }

    tx.Commit();
}

sw.Stop();
Console.WriteLine($"Inserted {data.RowNumber} records in {sw.Elapsed}");

UPDATE: I realized that my code isn't doing everything it needs to: it wasn't converting the unix seconds to date time. Modifying the insert loop as follows:

    while (data.Read())
    {
        cmd.Parameters[0].Value = DateTime.UnixEpoch.AddSeconds(data.GetInt64(0));
        cmd.Parameters[1].Value = data.GetString(1);
        cmd.Parameters[2].Value = data.GetString(2);
        cmd.ExecuteNonQuery();
    }

This slows things down to 1:17.5, which is marginally slower than sqlite3, but then the sqlite3 insert isn't doing the data conversion so you'll end up with the original integer (long) value.

MarkPflug
  • 28,292
  • 8
  • 46
  • 54
  • Hey Mark! Thanks for the suggestion. I added a note to the question explaining why I'd prefer to use EF Core here. See the section at the end labeled *Why EF Core?*. – dharmatech Dec 03 '21 at 19:31
  • The import program could get significantly more complicated with the extra relationships mentioned there. – dharmatech Dec 03 '21 at 19:41