0

I have scraped some baseball data from the web using Python and have saved it in a Pandas dataframe. One of the columns contains the numeric player identifying code, but it often null. Because Pandas integer columns cannot contain NaNs, this column has been converted to float and saved as such when I export to the data to csv.

This causes problems when I try and run a bulk insert into the MS SQL Server database as the relevant column is defined to be of type 'INT' (and also contains a foreign key constraint to another table). The following error message is raised:

Bulk load data conversion error (type mismatch or invalid character for the specified codepage)

I tried the suggestion provided here, but it seems changing the format of the float does not solve the problem and SQL Server still throws the same error.

I am aware that packages such as pyodbc provide an alternate way of transferring data to a database, but my dataframe is quite large (approx 6M rows, 70 cols) and I have found the process too slow, at least compared to saving to csv and importing.

How can I best solve this problem? Ultimately, I'm not wedded to saving to csv and using bulk insert in SQL Server, but I do want a method that is similarly fast.

user3725021
  • 566
  • 3
  • 14
  • 32
  • have you considered bulk inserting data directly from python via `pyodbc` - https://stackoverflow.com/a/29649340/4098013 ? – Vivek Kalyanarangan Jan 29 '18 at 05:14
  • I hadn't considered it... But will it solve the problem? I need to insert integers from a column that has missing values in pandas... Given pandas can't store this field as in integer, surely when I insert using pyodbc, I will run into the same problem? – user3725021 Jan 29 '18 at 06:24
  • 1
    You can get cheeky by imputing missing values with something like `-9999999` (a value you will never see in the data) and then try an update on that – Vivek Kalyanarangan Jan 29 '18 at 06:45

1 Answers1

1
pd.fillna(-1)

This will fill all blank values to a desired value (such as a negative value). See the documentation for more parameters.

Rookie
  • 305
  • 2
  • 11
  • This works (and solves my problem conveniently). However the fact that missing values are not allowed in integer columns of a pandas dataframe feels like flaw not a feature. Surprised this hasn't been changed. – user3725021 Jan 30 '18 at 03:58