The CSV file
I have a CSV file that's about 1.3 GB in size:
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 10/4/2021 1:23 PM 1397998768 XBTUSD.csv
This is the complete list of trading data for Bitcoin on the Kraken exchange.
Here's what the data in the CSV looks like:
> Get-Content .\XBTUSD.csv | Select-Object -First 10
1381095255,122.00000,0.10000000
1381179030,123.61000,0.10000000
1381201115,123.91000,1.00000000
1381201115,123.90000,0.99160000
1381210004,124.19000,1.00000000
1381210004,124.18000,1.00000000
1381311039,124.01687,1.00000000
1381311093,124.01687,1.00000000
1381311094,123.84000,0.82300000
1381431835,125.85000,1.00000000
More information about the file is available here:
The file can be downloaded from here:
https://drive.google.com/drive/folders/1jI3mZvrPbInNAEaIOoMbWvFfgRDZ44TT
See the file XBT.zip
. Inside that archive is XBTUSD.csv
.
Baseline test - importing directly into sqlite
If I create the following table in sqlite:
CREATE TABLE CsvTrades (
"TimeStamp" TEXT NOT NULL,
"Price" TEXT NOT NULL,
"Volume" TEXT NOT NULL
);
and run the following to import the CSV (as well as time how long it takes):
$a = Get-Date
sqlite3.exe .\kraken-trades.db -cmd '.mode csv' '.import C:/Users/dharm/XBTUSD.csv CsvTrades'
$b = Get-Date
($b - $a).TotalMinutes
I get the following:
1.56595191666667
1.5 minutes. Not bad!
Using EF Core
In the code below, I'm using the CsvHelper
package:
https://joshclose.github.io/CsvHelper/getting-started/
Here's a class for the CSV file rows:
public class CsvRow
{
[CsvHelper.Configuration.Attributes.Index(0)]
public long TimeStamp { get; set; }
[CsvHelper.Configuration.Attributes.Index(1)]
public decimal Price { get; set; }
[CsvHelper.Configuration.Attributes.Index(2)]
public decimal Quantity { get; set; }
}
Here's a class for the Trade
entity:
[Index(nameof(TimeStamp))]
public class Trade
{
public int Id { get; set; }
public decimal Price { get; set; }
public decimal Quantity { get; set; }
public DateTime TimeStamp { get; set; }
}
The DbContext
is straightforward:
public class AppContext : DbContext
{
public DbSet<Trade> Trades { get; set; }
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
var folder = Environment.SpecialFolder.LocalApplicationData;
var path = Environment.GetFolderPath(folder);
var db_path = $"{path}{System.IO.Path.DirectorySeparatorChar}kraken-trades.db";
optionsBuilder.UseSqlite($"Data Source={db_path}");
}
}
And finally, the function that performs the import:
void initialize_from_csv()
{
var config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
HasHeaderRecord = false
};
using (var reader = new StreamReader(@"C:\Users\dharm\XBTUSD.csv"))
using (var csv = new CsvReader(reader, config))
{
var records = csv.GetRecords<CsvRow>().Select(row => new Trade()
{
Price = row.Price,
Quantity = row.Quantity,
TimeStamp = DateTimeOffset.FromUnixTimeSeconds(row.TimeStamp).UtcDateTime
});
using (var db = new AppContext())
{
Console.WriteLine(DateTime.Now);
while (true)
{
//var items = records.Take(10_000).ToList();
var items = records.Take(100_000).ToList();
if (items.Any() == false) break;
Console.WriteLine("{0:yyyy-MM-dd}", items[0].TimeStamp);
db.AddRange(items);
db.SaveChanges();
}
Console.WriteLine(DateTime.Now);
}
}
}
The question
When I let this run, it does indeed proceed to add the items to the database. However, it's pretty slow; I haven't timed it to completion but I could see it taking over an hour.
Is there a good way to make this faster while still using EF Core?
Notes
The code referenced above is available here in a single file:
It's a .NET 6 project. Let me know if you have any issues building and running it.
Timing
I added some code to time the batch adds. It looks like it's taking around 7 seconds per 100,000 records.
Starting batch at 2013-10-06. Batch took 00:00:08.7689932.
Starting batch at 2015-12-08. Batch took 00:00:06.7453421.
Starting batch at 2016-04-19. Batch took 00:00:06.7833506.
Starting batch at 2016-06-25. Batch took 00:00:06.7083806.
Starting batch at 2016-08-22. Batch took 00:00:06.7826717.
Starting batch at 2016-11-20. Batch took 00:00:06.4212123.
wc
says there are 41,695,261 lines:
$ wc -l XBTUSD.csv
41695261 XBTUSD.csv
So at that rate, it would take around 48 minutes.
Why EF Core?
Some folks have asked, why use EF Core for this? Why not just a direct import?
The example above was intentionally simplified to focus on the speed of the import.
I have more elaborate versions where there are relationships with other entities. In that case:
Setting up the other tables and foreign key properties is more straightforward with EF Core.
I can more easily switch between database backends (SQL Server, PostgreSQL, sqlite).
See for example this branch where multiple symbols are imported. There's a relationship between the Trade
and Symbol
there. There could be other relationships as well.
https://github.com/dharmatech/kraken-trades-database/blob/006/KrakenTradesDatabase/Program.cs