4

I have a data-analysis script that I am putting together. This script connects to Teradata, Select * from the table, and loads that into the pandas dataframe.

import teradata
import pandas as pd

with udaExec.connect(method="xxx", dsn="xxx", username="xxx", password="xxx") as session:

    query = "Select * from TableA"

    # read in records
    df = pd.read_sql(query, session)

    # misc pandas tests below...

This works great for tables with 100k records or less, but the problem is that many tables have far more records than that (millions and millions of records), and it just tends to run indefinitely.

Is there some intermediate step I can take? I've been researching and I see something about copying the DB table to a .csv file or .txt file or something first, and then loading the pandas dataframe from that(instead of loading from the table itself), but I can't make sense of it.

Any advice would be appreciated! Thanks.

cmaher
  • 5,100
  • 1
  • 22
  • 34
JD2775
  • 3,658
  • 7
  • 30
  • 52
  • 1
    Do you need to access all the data at the same time? Could you only read the database by lots (using criteria or number of records) ? Do you need to access the full data to make computation out of it? – Dric512 Apr 04 '18 at 19:09
  • @Dric512 yes the idea is to do analysis on the table as a whole. For example overall count, count of null values in each column, min/max of columns with dtype= int64 etc....all of this has been coded out and works fine with small tables so far. Maybe this isn't realistic with big datasets? Kind of new to this – JD2775 Apr 04 '18 at 19:12
  • 1
    What you could try is to convert the database to a more efficient format, for example HDF5 (https://pandas.pydata.org/pandas-docs/stable/io.html). But even if it is more efficient, it will still have a limit. In this case, you may need to work on partial data. But it will still have some limits, is Pandas the real solution? Could you do some preprocessing in SQL directly? – Dric512 Apr 04 '18 at 19:17
  • Ah...thats not a bad idea. Maybe I can do some of this pre-processing in SQL, load the results to a temp-table and read from the temp table instead. That's not a bad idea. I'd just need to figure out how to automate that within this script, but it seems do-able. thanks for the suggestion – JD2775 Apr 04 '18 at 19:20
  • 1
    It sounds like you are asking pandas to do a lot of things that are pretty straightforward to do with SQL in the database itself. Maybe you could just use pandas to retrieve lots of little results sets? e.g., `df = pd.read_sql('select count * from TableA', session)`. Alternatively, if you must retrieve a large table in its entirety, it will probably be faster to use a command (specific to your database software) to do a bulk copy to a local CSV file, then use pandas to read in the CSV. Let me know if you want example code. Pandas is incredibly quick at reading CSV files but slow for SQL. – Matthias Fripp Apr 04 '18 at 20:14
  • thanks @MatthiasFripp. If you have any example code you'd be willing to share that would be great. You are right that most of this I could really just do with SQL, I am wanting to use Python/Pandas as we are exploring it recently at my work but maybe I can do some combination of the 3 instead as you mentioned. In fact I really don't need to read in the entire dataset, if i can use variables for each result like the example you have above...then store the results of those variables in a new dataframe. Something like that... – JD2775 Apr 04 '18 at 20:27
  • @JD2775: I just added an answer with my example code (for postgresql). – Matthias Fripp Apr 06 '18 at 01:55

2 Answers2

5

In a comment I promised to provide some code that can read a table from a server quickly into a local CSV file, then read that CSV file into a Pandas dataframe. Note that this code is written for postgresql, but you could probably adapt it pretty easily for other databases.

Here is the code:

from cStringIO import StringIO
import psycopg2
import psycopg2.sql as sql
import pandas as pd

database = 'my_db'
pg_host = 'my_postgres_server'
table = 'my_table'
# note: you should also create a ~/.pgpass file with the credentials needed to access
# this server, e.g., a line like "*:*:*:username:password" (if you only access one server)

con = psycopg2.connect(database=database, host=pg_host)
cur = con.cursor()    

# Copy data from the database to a dataframe, using psycopg2 .copy_expert() function.
csv = StringIO()  # or tempfile.SpooledTemporaryFile()
# The next line is the right way to insert a table name into a query, but it requires 
# psycopg2 >= 2.7. See here for more details: https://stackoverflow.com/q/13793399/3830997
copy_query = sql.SQL("COPY {} TO STDOUT WITH CSV HEADER").format(sql.Identifier(table))
cur.copy_expert(copy_query, csv)
csv.seek(0)  # move back to start of csv data
df = pd.read_csv(csv)

Here also is some code that writes large dataframes to the database via the CSV route:

csv = StringIO()
df.to_csv(csv, index=False, header=False)
csv.seek(0)
try:
    cur.copy_from(csv, table, sep=',', null='\\N', size=8192, columns=list(df.columns))
    con.commit()
except:
    con.rollback()
    raise

I tested this code over my 10 Mbps office network (don't ask!) with a 70,000 row table (5.3 MB as a CSV).

When reading a table from the database, I found that the code above was about 1/3 faster than pandas.read_sql() (5.5s vs. 8s). I'm not sure that would justify the extra complexity in most cases. This is probably about as fast as you can get -- postgresql's COPY TO ... command is very fast, and so is Pandas' read_csv.

When writing a dataframe to the database, I found that using a CSV file (the code above) was about 50x faster than using pandas' df.to_sql() (5.8s vs 288s). This is mainly because Pandas doesn't use multi-row inserts. This seems to have been a subject of active discussion for several years -- see https://github.com/pandas-dev/pandas/issues/8953 .

A couple of notes about chunksize: this may not do what most users expect. If you set chunksize in pandas.read_sql(), the query still runs as one command, but the results are returned to your program in batches; this is done with an iterator that yields each chunk in turn. If you use chunksize in pandas.to_sql(), it causes the inserts to be done in batches, reducing memory requirements. However, at least on my system, each batch is still broken down into individual insert statements for each row, and those take a long time to run.

UPDATE: Pandas added a method argument to .to_sql() soon after I wrote this, which may do exactly what you want. Check the documentation for the 'multi' and callable options.

Matthias Fripp
  • 17,670
  • 5
  • 28
  • 45
1

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html

It appears that the .read_sql() method has a chunksize argument. Have you tried something like df = pd.read_sql(query, session, chunksize = 100,000)? (I'm using a chunksize of 100k since you said 100k records weren't a problem).

T. Kelly
  • 85
  • 1
  • 8