I have an sqlite database table of ~529M rows. I chose sqlite because there won't be many writes to the db (just mainly reads) and I wanted the simplicity of having it in a single file. Unfortunately, I made a mistake in generating the database: now I have to change some NULL values in two columns via an inner join to another table.
Table formats:
>>> cdr
ego_id alter_id date tower_id city state
123 456 20200101 98766 Los Angeles California
789 143 20200105 09232 NULL NULL
789 143 20200105 42106 NULL NULL
>>> towermap
tower_id city state
98766 Los Angeles California
09232 Rochester New York
what I want is to have the NULL
values checked in cdr
and replace them with the city,state
values in towermap
corresponding to tower_id
. The result should be:
>>> cdr
ego_id alter_id date tower_id city state
123 456 20200101 98766 Los Angeles California
789 143 20200105 09232 Rochester New York
789 143 20200105 42106 NULL NULL
Here's my raw SQL that I executed using sqlalchemy (I'm using Python). It seems sqlite can't do UPDATE with INNER JOINs, so I made the following code after reading the answers in a related question.
q = """
UPDATE cdr
SET city = (SELECT city FROM towermap WHERE tower_id = cdr.tower_id),
state=(SELECT state FROM towermap WHERE tower_id = cdr.tower_id)
WHERE (city IS NULL OR state IS NULL)
"""
engine = sqlalchemy.create_engine('sqlite:///my_file_loc.db')
conn = engine.connect()
conn.execute(q)
conn.close()
I expect ~35M rows to have NULL values that may or may not be replaced (depending on whether their tower_id
exists in towermap
). It's been 18 hours already and the db-journal is still getting bigger. Memory doesn't seem to be a problem.
Can my code be further optimized for speed? I could probably add something like
PRAGMA synchronize = OFF;
but I would avoid that if possible.