1

I'm trying to do something fairly simple, but either odo is broken or I don't understand how datashapes work in the context of this package.

The CSV file:

email,dob
tony@gmail.com,1982-07-13
blah@haha.com,1997-01-01
...

The code:

from odo import odo
import pandas as pd

df = pd.read_csv("...")
connection_str = "postgresql+psycopg2:// ... "

t = odo('path/to/data.csv', connection_str, dshape='var * {email: string, dob: datetime}')

The error:

AssertionError: datashape must be Record type, got 0 * {email: string, dob: datetime}

It's the same error if I try to go directly from a DataFrame -> Postgres as well:

t = odo(df, connection_str, dshape='var * {email: string, dob: datetime}')

A few other things that don't fix the problem: 1) removing the header line from the CSV file, 2) changing var to the actual number of rows in the DataFrame.

What am I doing wrong here?

lollercoaster
  • 15,969
  • 35
  • 115
  • 173
  • have you tried pd.to_sql? Seems like you're just trying to save a csv into a postgres table? https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html – wkzhu Sep 18 '17 at 21:00
  • yes, it's just really slow. `odo` is supposed to use postgres's copy internals to do it much, much more quickly: http://odo.pydata.org/en/latest/perf.html – lollercoaster Sep 18 '17 at 21:09
  • I'm not familiar with `odo` but you can do fast loading yourself https://stackoverflow.com/questions/41875817/write-fast-pandas-dataframe-to-postgres/ – Michael Sep 18 '17 at 22:35
  • No ideally you want to copy from a file to Postgres directly. That way Postgres + the OS does all the real work (much faster). I'm loading hundreds of GB. I put the example above where I tried to go from Python in memory to Postgres just to demonstrate that the odo library wasn't working as intended. – lollercoaster Sep 18 '17 at 22:38
  • do you need pandas in the first place - csv straight to postgres should be easy https://stackoverflow.com/questions/2987433/how-to-import-csv-file-data-into-a-postgresql-table – wkzhu Sep 19 '17 at 15:15
  • Yes I need to select particular columns. Also need to save disk space. – lollercoaster Sep 21 '17 at 22:21

1 Answers1

1

Does connection_str have a table name? That fixed it for me when I ran into a similar issue but with a sqlite database.

Should be something like:

connection_str = "postgresql+psycopg2://your_database_name::data"
t = odo(df, connection_str, dshape='var * {email: string, dob: datetime}')

where 'data' in 'connection_str' is your new table name.

See also:

python odo sql AssertionError: datashape must be Record type, got 0 * {...}

https://github.com/blaze/odo/issues/580

mbyim
  • 66
  • 3