so I run my mysql scripts through a python wrapper, the important parts are here though. My goal through this was to see which was faster, to make two separate select DISTINCT() calls and find the ones the ones in list a that are not in list b. I also did this using subqueries.
I found that the subqueries were much faster (as hoped) however, the results it yielded were wrong.
starttime1 = time.time()
LoLs.cursor.execute("SELECT DISTINCT(Id) FROM participants")
drop = cursor.fetchall()
drop = [x[0] for x in drop]
LoLs.cursor.execute("SELECT DISTINCT(Id) FROM history")
exist = cursor.fetchall()
exist = [x[0] for x in exist]
drop1 = list(set(drop)-set(exist))
endtime1 = time.time()
print len(drop1)
print endtime1 - starttime1
this results in:
545594
205.518273115
as expected.
However:
starttime2 = time.time()
cursor.execute("SELECT DISTINCT(Id) FROM participants where Id NOT IN (SELECT DISTINCT(Id) FROM history)")
drop2 = cursor.fetchall()
drop2 = [x[0] for x in drop2]
drop2 = list(set(drop2))
endtime2 = time.time()
print len(drop2)
print endtime2 - starttime2
results in:
427333
185.753734112
It's much faster, awesome, but it results in only 427333 entries instead of 545594.
as a note i did a set difference for the two end lists print len(list(set(drop1)-set(drop2)))
results in: 118261
so there are 118261 unique entries in the first list that are not in the second. I even took some of these and checked if they were in the history table manually, and they were, so the subquery should've caught them.
NOTE March 6th: changed:
cursor.execute("SELECT DISTINCT(Id) FROM participants where summonerId NOT IN (SELECT DISTINCT(Id) FROM history)")
to
cursor.execute("SELECT DISTINCT(Id) FROM participants where Id NOT IN (SELECT DISTINCT(Id) FROM history)")
to accurately reflect the real problem.