3

I have a multiple csv files with the following content:

Duration (ms),Start date,End date,Start station number,Start station,End station number,End station,Bike number,Member Type
840866,8/31/2016 23:59,9/1/2016 0:13,31117,15th & Euclid St  NW,31228,8th & H St NW,W20409,Registered

And I have about 10 millions raws of this data.

I need to normalise this data and split it into tables. I suppose there will be tables: stations, bikes, rides. In terms of OLAP rides are facts, and stations and bikes are dimensions. I'm very new to data analysis so I could use incorrect terms. But I'm trying to use this approach.

So the question is how to write this data into database as optimal as it possible? The approach I can imagine is following:

1) Read line from csv
2) Create a record for station and get foreign key for it (with direct SQL query).
3) Create a record for a bike and get FK for it.
4) Create a record for a datetime (i'm not sure if it could be useful for further data analysis)
5) Create a record for ride with FK for each of it 'dimensions'
6) Repeat

But if I have 10 millions of rows, this approach will make ~40 millions of queries to database, which looks terrible and not optimal.

Is there more optimal approaches/algorithms/technologies to do it? I'm going to use python and psql for it, if it's important.

Community
  • 1
  • 1
Paul
  • 6,641
  • 8
  • 41
  • 56
  • 1) Load whole data into the temporary table using `copy` command ([docs for PG](https://www.postgresql.org/docs/current/static/sql-copy.html) and [docs for psycopg](http://initd.org/psycopg/docs/cursor.html#cursor.copy_from)); 2) Split the data using SQL – Abelisto Oct 28 '16 at 16:44
  • Depending on your version of Postgresql, you can use the file_fdw and select directly against the csv files (no pre-loading required). – gsiems Oct 28 '16 at 16:49

2 Answers2

1

You can probably economize on queries by memoizing the function that creates the unique records, for example:

from functools import lru_cache

@lru_cache(maxsize=128)
def save_station(s):
    """Create station record and return primary key."""
    station = create_or_get_station_record(...)
    return station.id

If the input is sorted by station, then subsequent calls to save_station will not query the database once the record is created. Even if it's not perfectly sorted, this could help.

You can batch the saving of rides. Accumulate records and then call an execute_many function (will depend on the libraries you are using).

You could pre-process the data to create separate CSV files, then load each file.

ChrisP
  • 5,812
  • 1
  • 33
  • 36
0

According to the PostgreSQL documentation, copy command is an optimum approach for populating a table with a large number of rows. At the other hand, for processing csv files pandas library is one of the best tools.

So the below steps can be an acceptable solution:

Load csv files into data-frames with pandas read_csv()
Process the data-frames into the desired form
Store processed data-frames into temporary csv files
Create desired tables using SQL
Load data from temporary csv files into tables using copy SQL command
amin
  • 1,413
  • 14
  • 24