I am trying to run the following query on an SQL table of about 2.66 billion rows, keep in mind I am using Python psycopg (but I can use something else if it would make my life easier) to eventually convert those results into a csv file:
import csv
from psycopg2 import sql
year = 2013
stmt=sql.SQLI(f"""SELECT * FROM skill TABLESAMPLE SYSTEM (1) WHERE skillclusterfamily='Science and Research' AND DATE_PART('year',jobdate)={year}""")
cur.execute(stmt)
res=cur.fetchall()
print(res)
Keep in mind that 1% is about 26.6 million rows alone, and this one singular query will take about 2 hours on my system, that is simply too long. I don't have access to an NVIDIA GPU however all I wanted was to just query this data and put it in a CSV file so I can use it elsewhere. My ultimate goal is to write a csv file with data from years 2007-2021 using a for loop, and store that in one csv file using a randomly sampled data from the table as 26.6 million rows should be enough hence why I am only doing 1%. Is there anyway to make this kind of query faster? I cannot install extensions.
I tried systems and Bernoulli sampling, neither were any faster to be honest.