1

So my use-case is this - I need to parse 800,000 files in an S3 bucket, extract the data I need, and then read it into a SQLite database.

I tried looping over them and then individually writing into the DB each time, but this was taking too long (over 36 hours passed and I wasn't even 40% done).

So my next approach is to loop over the 800k files, extract the data I need, and then write the intended output as a .txt file. This is so I can then simply run this giant .txt file as SQL and read in the necessary rows.

This is the premise:

# First, I create a .txt file that houses the queries I want,

query = f"insert into table_abc values ({abc})"

with open("sql_queries.txt", "a") as file_object:
                file_object.write(query+"\n")
# Then, I execute the queries within this .txt file to read them into my db.

import sqlite3

conn = sqlite3.connect("example.db")

c = conn.cursor()

with open("sql_queries.txt") as sql_file:
    sql_as_string = sql_file.read()
    c.executescript(sql_as_string)

The problem of course is the fact that there are always characters that end up not being escaped or properly sanitized. Is there any way to insert parameterized queries into my text file instead, so that when it's run, it doesn't run into problems like OperationalError: unrecognized token: "^".

I know why this is happening, because my .txt file houses SQL insert queries that have characters not properly escaped, but I'm unsure how to go about building a .txt file where the queries are properly sanitized, and there doesn't seem to be much info on this outside of using parametrized queries directly in write statements.

Yeahprettymuch
  • 501
  • 5
  • 16
  • I'd concentrate on fixing whatever you're doing wrong with your first approach (Inserting each row as a separate transaction, maybe, instead of batching them together? Not using `executemany()`? etc.) – Shawn Dec 28 '20 at 10:31
  • See https://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite (C, but parts might be applicable to python's sqlite3 bindings) – Shawn Dec 28 '20 at 10:33
  • Is this the same talbe for all queries? Have you tried using pandas? You can create one dataframe with all your data and save it to your DB with pd.to_sql No need to execute so many sql transactions – IoaTzimas Dec 28 '20 at 11:16
  • Yep, I tried pandas approach but my program ran out of memory! Hence why i figured I need another method. – Yeahprettymuch Dec 28 '20 at 11:40

0 Answers0