0

Currently have a tricky problem and need ideas for most efficient way to go about solving.

We periodically iterate through large CSV files (~50000 to 2m rows), and for each row, we need to check a database table for matching columns.

So for example, each CSV row could have details about an Event - artist, venue, date/time etc, and for each row, we check our database (PG) for any rows that match the artist, venue and date/time the most, and then perform operations if any match is found.

Currently, the entire process is highly CPU, memory and time intensive pulling row by row, so we perform the matching in batches, but still seeking ideas for an efficient way to perform the comparison both memory-wise, and time-wise

Thanks.

Oduah Tobi
  • 45
  • 1
  • 9
  • It would definitely help if you provided schema and examples of relations that needs to be taken into account. But generally batching is a good way to go. – Ivan Yurov Feb 27 '19 at 14:35

1 Answers1

1
  1. Load the complete CSV file into a temporary table in your database (using a DB tool, see e.g. How to import CSV file data into a PostgreSQL table?)
  2. Perform matching and operations in-database, i.e. in SQL
  3. If necessary, truncate the temporary table afterwards

This would move most of the load into the DB server, avoiding all the ActiveRecord overhead (network traffic, result parsing, model instantiation etc.)

claasz
  • 2,059
  • 1
  • 14
  • 16
  • Thanks for suggesting! We had some constraints initially that basically took importing into a temporary table out of the question, but we've worked around that now – Oduah Tobi Feb 27 '19 at 18:37