2

When converting one of the .csv's to a pd.DataFrame using the python odo module I receive a TypeError

    TypeError: Cannot cast array from dtype('float64') to dtype('int64') 
               according to the rule 'safe'

The code that works for other csv's

# csv table file name
csvNm = 'table.csv'

# convert mysql table to csv
odo_csv = odo(tstConn.connect_string + '::' + tbl , csvNm)

# convert csv to pandas 
odo_df = odo(odo_csv , pd.DataFrame)

Here is what I tried so far to no avail:

import pandas as pd
from odo import odo, resource, discover, convert

odo_csv=odo(tstConn.connect_string + '::' + tbl , csvNm)
csv=resource(csvNm)
ds=discover(csv)

# Convert csv to pandas
odo_df = odo(odo_csv , pd.DataFrame, dshape=ds) 

and this:

odo_df = odo(odo_csv , pd.DataFrame, casting='unsafe')

Update 1 It looks like I neglected the most obvious hint in this error

pandas\parser.pyx in pandas.parser.TextReader._convert_tokens (pandas\parser.c:11816)()

Leading to encoding issues in Windows SO. But neither this:

odo_df = odo(odo_csv , pd.DataFrame, encoding=odo_csv.encoding)

or this work

odo_df = odo(odo_csv , pd.DataFrame, encoding='cp1252') 

This inelegant way (for my use-case) taken from pandas-reading-csv-files (same link as above)

# Python3
with open('/tmp/test.csv', 'r', encoding='cp1252') as f:
    df = pd.read_csv(f)
    print(df)

Not sure what to try next, any help would be appreciated.

Community
  • 1
  • 1
Itay Livni
  • 2,143
  • 24
  • 38

1 Answers1

0

The solution that works is:

import pandas as pd
from odo import odo, resource, discover, convert

# convert mysql to csv
odo_csv=odo(raw_dbConn.connect_string + '::' + tblName , csvNm, header=True)

# Get odo resource aka sqlalchemy.Table instance
resc=resource(raw_dbConn.connect_string + '::' + tblName )

# Discover the resc
ds=discover(resc)

# Convert csv to dataframe    
odo_df = odo(odo_csv , pd.DataFrame, dshape=ds ,encoding=odo_csv.encoding)
Itay Livni
  • 2,143
  • 24
  • 38
  • Funny thing is that `with open('/tmp/test.csv', 'r', encoding='cp1252') as f: df = pd.read_csv(f)` is about a minute faster for a df with a shape of (114,690,970, 5) – Itay Livni Jul 09 '16 at 06:22