I have a multiple csv files with the following content:
Duration (ms),Start date,End date,Start station number,Start station,End station number,End station,Bike number,Member Type
840866,8/31/2016 23:59,9/1/2016 0:13,31117,15th & Euclid St NW,31228,8th & H St NW,W20409,Registered
And I have about 10 millions raws of this data.
I need to normalise this data and split it into tables. I suppose there will be tables: stations, bikes, rides. In terms of OLAP rides are facts, and stations and bikes are dimensions. I'm very new to data analysis so I could use incorrect terms. But I'm trying to use this approach.
So the question is how to write this data into database as optimal as it possible? The approach I can imagine is following:
1) Read line from csv
2) Create a record for station and get foreign key for it (with direct SQL query).
3) Create a record for a bike and get FK for it.
4) Create a record for a datetime (i'm not sure if it could be useful for further data analysis)
5) Create a record for ride with FK for each of it 'dimensions'
6) Repeat
But if I have 10 millions of rows, this approach will make ~40 millions of queries to database, which looks terrible and not optimal.
Is there more optimal approaches/algorithms/technologies to do it? I'm going to use python and psql for it, if it's important.