0

I have an sqlite database table of ~529M rows. I chose sqlite because there won't be many writes to the db (just mainly reads) and I wanted the simplicity of having it in a single file. Unfortunately, I made a mistake in generating the database: now I have to change some NULL values in two columns via an inner join to another table.

Table formats:

>>> cdr
ego_id   alter_id   date        tower_id   city         state
123      456        20200101    98766      Los Angeles  California
789      143        20200105    09232      NULL         NULL
789      143        20200105    42106      NULL         NULL


>>> towermap
tower_id     city        state
98766        Los Angeles California
09232        Rochester   New York

what I want is to have the NULL values checked in cdr and replace them with the city,state values in towermap corresponding to tower_id. The result should be:

>>> cdr
ego_id   alter_id   date        tower_id   city         state
123      456        20200101    98766      Los Angeles  California
789      143        20200105    09232      Rochester    New York
789      143        20200105    42106      NULL         NULL

Here's my raw SQL that I executed using sqlalchemy (I'm using Python). It seems sqlite can't do UPDATE with INNER JOINs, so I made the following code after reading the answers in a related question.

q = """
UPDATE cdr                                                                                                                      
SET city = (SELECT city FROM towermap WHERE tower_id = cdr.tower_id),
      state=(SELECT state FROM towermap WHERE tower_id = cdr.tower_id)
WHERE (city IS NULL OR state IS NULL)
"""

engine = sqlalchemy.create_engine('sqlite:///my_file_loc.db')
conn = engine.connect()
conn.execute(q)
conn.close()

I expect ~35M rows to have NULL values that may or may not be replaced (depending on whether their tower_id exists in towermap). It's been 18 hours already and the db-journal is still getting bigger. Memory doesn't seem to be a problem.

Can my code be further optimized for speed? I could probably add something like PRAGMA synchronize = OFF; but I would avoid that if possible.

irene
  • 2,085
  • 1
  • 22
  • 36

1 Answers1

1

You can update both columns with a single subquery using row value notation, reducing the amount of work that has to be done on each matching row by half:

UPDATE cdr                                                                                                                      
SET (city, state) = (SELECT city, state FROM towermap WHERE tower_id = cdr.tower_id)
WHERE (city IS NULL OR state IS NULL)

towermap.tower_id should have an index if it doesn't already (Like it would if it was the primary key. Full table definitions are always better than just a list of column names).

Shawn
  • 47,241
  • 3
  • 26
  • 60
  • I'm getting a syntax error with this, which is weird because I checked the link above and it is what is recommended. It seems the error is somewhere in the `SET (city,state)` part. – irene Feb 07 '20 at 12:40
  • `UPDATE cdr SET (city, state) = (SELECT city, state FROM towermap WHERE tower_id = cdr.tower_id) WHERE (city IS NULL OR state IS NULL)` Traceback (most recent call last): File "update_null_loc_db.py", line 125, in main(*sys.argv) File "update_null_loc_db.py", line 117, in main c.execute(update_query) sqlite3.OperationalError: near "(": syntax error – irene Feb 07 '20 at 12:43
  • 1
    @irene check your SQLite version. Row Values are supported since version 3.15.0 – forpas Feb 07 '20 at 14:17
  • I see, thank you. I've updated my SQLite version and I'm now building the indices. I'll keep you posted once I've finished running the code with the row values! – irene Feb 09 '20 at 10:18
  • Finally ran it with the indices, and it took around 20 minutes. Thanks a lot! – irene Feb 09 '20 at 16:14