3

I have over a Million times snapshots files that I need to merge and create a single file/db for analysis.

My attempt to do this in the code below. first, I read a small csv from a list of URLs, takes a few columns, parse date field from text to date and writes it to a sqlite database.

while this code works well enough over a small subset of files, is too slow to iterate over a million CSVs.

I'm not sure how to increase performance or even whether Python is the right tool for the job or not. any help in improving this code or suggestions will be much appreciated.

import pandas as pd
from sqlalchemy import create_engine
import datetime
import requests
import csv
import io

csv_database2 = create_engine('sqlite:///csv_database_test.db')

col_num = [0,8,9,12,27,31]

with open('url.csv','r') as line_list:
     reader = csv.DictReader(line_list,)

for line in reader:

    data = requests.get(line['URL'])
    df = pd.read_csv(io.StringIO(data.text), usecols=col_num, infer_datetime_format=True)
    df.columns.values[0] = 'DateTime'
    df['ParseDateTime'] = [datetime.datetime.strptime(t, "%a %b %d %H:%M:%S %Y") for t in df.DateTime]
    df.to_sql('LineList', csv_database2, if_exists='append')
Chiyo018
  • 33
  • 2
  • A million files is a million files.. – user2864740 Feb 25 '18 at 19:40
  • Can the 'files' be "accessed locally"? Network access is *relatively slow*, and can benefit from parallelization. Can the 'files' be "pre-merged" (ie. concatenated by other tools) or "batched" (ie. by the remote server)? Can the network 'file' fetch be *separated* from the processing? -- as should be obvious, I suspect the #1 performance bottleneck is 'requests.get`. Verify and/or disprove this. – user2864740 Feb 25 '18 at 19:41
  • probably want to avoid using pandas as that's more overhead plus holding everything in memory – eagle Feb 25 '18 at 19:42
  • Consider [importing csv](https://www.sqlite.org/cvstrac/wiki?p=ImportingFiles) directly into SQLite using its CLI. Set up a PowerShell/Bash script to run through a directory. – Parfait Feb 25 '18 at 20:26

1 Answers1

1

IMHO python is well suited for this task and with simple modifications you can achieve your desired performance.

AFAICS there could be two bottlenecks that affect performance:

downloading the urls

you download a single file at a time, if download a file takes 0.2 sec to download 1M files it'll take > 2 days! I suggest you'll parallelize the download, example code using concurrent.futures:

from concurrent.futures import ThreadPoolExecutor
import requests


def insert_url(line):
    """download single csv url and insert it to SQLite"""
    data = requests.get(line['URL'])
    df = pd.read_csv(io.StringIO(data.text), usecols=col_num,
                     infer_datetime_format=True)
    df.columns.values[0] = 'DateTime'
    df['ParseDateTime'] = [
        datetime.datetime.strptime(t, "%a %b %d %H:%M:%S %Y") for t in
        df.DateTime]
    df.to_sql('LineList', csv_database2, if_exists='append')


with ThreadPoolExecutor(max_workers=128) as pool:
    pool.map(insert_url, lines)

inserting to SQL

try to take a look at how to optimize the SQL insertions at this SO answer.

Further guidance

  • I would start with the parallel requests as it seems larger bottleneck
  • run profiler to get better idea where your code spends most of the time
ShmulikA
  • 3,468
  • 3
  • 25
  • 40
  • Thank you! That was very helpful, there is definitely an improvement requesting access to URL. I will look into improving the writing to sql code a bit further. – Chiyo018 Feb 26 '18 at 22:21