using odo to load CSV -> postgres on AWS

Question

I'm trying to do something fairly simple, but either odo is broken or I don't understand how datashapes work in the context of this package.

The CSV file:

email,dob
tony@gmail.com,1982-07-13
blah@haha.com,1997-01-01
...

The code:

from odo import odo
import pandas as pd

df = pd.read_csv("...")
connection_str = "postgresql+psycopg2:// ... "

t = odo('path/to/data.csv', connection_str, dshape='var * {email: string, dob: datetime}')

The error:

AssertionError: datashape must be Record type, got 0 * {email: string, dob: datetime}

It's the same error if I try to go directly from a DataFrame -> Postgres as well:

t = odo(df, connection_str, dshape='var * {email: string, dob: datetime}')

A few other things that don't fix the problem: 1) removing the header line from the CSV file, 2) changing var to the actual number of rows in the DataFrame.

What am I doing wrong here?

have you tried pd.to_sql? Seems like you're just trying to save a csv into a postgres table? https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html — wkzhu, Sep 18 '17 at 21:00
yes, it's just really slow. `odo` is supposed to use postgres's copy internals to do it much, much more quickly: http://odo.pydata.org/en/latest/perf.html — lollercoaster, Sep 18 '17 at 21:09
I'm not familiar with `odo` but you can do fast loading yourself https://stackoverflow.com/questions/41875817/write-fast-pandas-dataframe-to-postgres/ — Michael, Sep 18 '17 at 22:35
No ideally you want to copy from a file to Postgres directly. That way Postgres + the OS does all the real work (much faster). I'm loading hundreds of GB. I put the example above where I tried to go from Python in memory to Postgres just to demonstrate that the odo library wasn't working as intended. — lollercoaster, Sep 18 '17 at 22:38
do you need pandas in the first place - csv straight to postgres should be easy https://stackoverflow.com/questions/2987433/how-to-import-csv-file-data-into-a-postgresql-table — wkzhu, Sep 19 '17 at 15:15
Yes I need to select particular columns. Also need to save disk space. — lollercoaster, Sep 21 '17 at 22:21

score 1 · Accepted Answer · answered Oct 03 '17 at 23:36

Does connection_str have a table name? That fixed it for me when I ran into a similar issue but with a sqlite database.

Should be something like:

connection_str = "postgresql+psycopg2://your_database_name::data"
t = odo(df, connection_str, dshape='var * {email: string, dob: datetime}')

where 'data' in 'connection_str' is your new table name.

using odo to load CSV -> postgres on AWS

1 Answers1