-2

I am trying to run the following query on an SQL table of about 2.66 billion rows, keep in mind I am using Python psycopg (but I can use something else if it would make my life easier) to eventually convert those results into a csv file:

import csv
from psycopg2 import sql
year = 2013
stmt=sql.SQLI(f"""SELECT * FROM skill TABLESAMPLE SYSTEM (1) WHERE skillclusterfamily='Science and Research' AND DATE_PART('year',jobdate)={year}""")
cur.execute(stmt)
res=cur.fetchall()
print(res)

Keep in mind that 1% is about 26.6 million rows alone, and this one singular query will take about 2 hours on my system, that is simply too long. I don't have access to an NVIDIA GPU however all I wanted was to just query this data and put it in a CSV file so I can use it elsewhere. My ultimate goal is to write a csv file with data from years 2007-2021 using a for loop, and store that in one csv file using a randomly sampled data from the table as 26.6 million rows should be enough hence why I am only doing 1%. Is there anyway to make this kind of query faster? I cannot install extensions.

I tried systems and Bernoulli sampling, neither were any faster to be honest.

Cr3jus
  • 1
  • 1
  • Does this answer your question? [How to use server side cursors with psycopg2](https://stackoverflow.com/questions/41444890/how-to-use-server-side-cursors-with-psycopg2) – Mike Organek Jul 12 '23 at 02:55
  • You could use COPY for creating the csv, that is much faster. If the query itself is slow, we need the results from EXPLAIN(ANALYZE, VERBOSE, BUFFERS) for this statement and the DDL for the tables and indexes involved. (in plain text, as an update to your question) – Frank Heikens Jul 12 '23 at 05:44
  • General `psycopg2` notes: 1) Do not use `f` strings use [Parameter Passing](https://www.psycopg.org/docs/usage.html#passing-parameters-to-sql-queries). 2) This `sql.SQLI` should be `sql.SQL` and is not needed for this case as you are dynamically changing a parameter not the query statement. – Adrian Klaver Jul 12 '23 at 15:25

2 Answers2

0

Using TABLESAMPLE makes the query slow. Instead, you should create an index that supports the WHERE condition and use LIMIT to limit the number of result rows.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
-1

I think, for your query to run faster, don't use (SELECT *), just select what you need.Secondly you should count how many rows in the table first then divide your data and use multi-threading concept to reach your data faster.Thirdly create index and partitioning on columns in your condition query, in your case those are "skillclusterfamily, year". Example: I have 2 billion rows in my table, I would use 5 threads to read data from that table, thread 1 responsible for reading rows from 0 to 400.000.000, thread 2 responsible for reading rows from 400.000.001 to 800.000.000... and so on.

user2905416
  • 404
  • 7
  • 21