1

I already referred this post, this post, this post. So, please don't mark as duplicate

I have a raw data in pandas dataframe which is called temp_id as shown below.

The column is of datatype float64 because of NA's and it looks like as shown below in jupyter notebook

temp_id
55608.0
55609.0
NaN        
55610.0
NaN        
55611.0

In csv file, the same column looks like as shown below

temp_id
55608
55609
        #empty row indicating NA        
55610
        #empty row indicating NA
55611

Am trying to copy this data into a postgresql table with below table definition. Please note that it is not primary key and can have empty rows

CREATE TABLE temp(
  temp_id integer
  
);

When I try to copy the data, I get the below error

ERROR:  invalid input syntax for integer: "55608.0"
CONTEXT:  COPY temp, line 2, column temp_id: "55608.0"

How can I avoid this and insert this data into a integer column in Postgresql table? The below are the miscellaneous characters that I give in pgadmin during import csv

enter image description here

The Great
  • 7,215
  • 7
  • 40
  • 128
  • 2
    Have you tried casting the column to integer via `.astype(int)` before inserting? The error seems to indicate there's an issue with the decimal point in the floats – Lukas Thaler Jul 15 '21 at 11:11
  • If I try to cast it in `pandas`, it throws error because the column has `na` values. So cannot convert to `int`. – The Great Jul 15 '21 at 11:13
  • DataFrame columns can mix types, but the dtype will be read as "object". Replace None values with the string "NULL" which is used in SQL databases in place of NaN. Then, filter for your numerical values and cast them to int using `astype`. – Florian Fasmeyer Jul 15 '21 at 11:15
  • @LukasThaler - I get this error `ValueError: Cannot convert non-finite values (NA or inf) to integer` – The Great Jul 15 '21 at 11:15
  • @FlorianFasmeyer - but in csv display, it all looks like it's integer already. how come in error message it is showing float value? – The Great Jul 15 '21 at 11:16
  • Are you loading the data directly from the csv using Postgres' COPY functionality or from the dataframe through Python? If it's the latter, the dtype and formatting from the df is fed to the database, and that contains the decimal point – Lukas Thaler Jul 15 '21 at 11:19
  • 1
    Because either you or another process may be using numpy or pandas. `NaN` is a value in float, but int values do not have `NaN`. Said another way: Trying to build a column with `1, 2, 3.1`, the computer will assume the entire column is a float. Since `NaN` ony exists in floats, then the whole column becomes one. – Florian Fasmeyer Jul 15 '21 at 11:20

2 Answers2

1

The column you are trying to insert contains NaN (or None). One interesting thing to know is that the floating-point type does have a special value for NaN but integers do not. So when reading the csv file, the computer (pandas module) assumes that the whole column is made out of floating-point numbers.

a = [1, 2, 3.01] # Will be float when read by Pandas.
b = [1, 2, None] # Will be float when read by Pandas.

Solutions

  1. Remove rows with NaN & set to int
import pandas as pd

df = pd.DataFrame(dict(col=[1, 2, 3, 4, None]))
df = df.dropna()
df = df.astype(int)
  1. Some SQL databases use "NULL" to represent NaN, but it must be sent as a string. In the database, the column will be of int, but it must be set to "nullable".
import pandas as pd

df = pd.DataFrame(dict(col=[1, 2, 3, 4, None]))
# Note that Pandas accept mixed type columns. The column dtype will be of "Object".
df = df.fillna('NULL')
df = df.astype(str)
df['col'] = df['col'].apply(lambda x: x.replace('.0', ''))
Florian Fasmeyer
  • 795
  • 5
  • 18
0

The answer is similar to what @Lukas Thaler posted. But instead of astype(int), I had to use astype(Int64)

df['temp_id'] = df['temp_id'].astype('Int64')

This worked well to convert columns with NA to Int64 type and I was able to upload successfully.

So, it's not int but int64.

Dharman
  • 30,962
  • 25
  • 85
  • 135
The Great
  • 7,215
  • 7
  • 40
  • 128