4

I am trying to write from a pandas dataframe to AWS redshift:

df_tmp_rpt = pd.read_csv('path')
df_tmp_rpt = df_tmp_rpt[df_tmp_rpt['COL'] == 'VALUE']
df_tmp_rpt = df_tmp_rpt.replace(np.nan, null, regex=True)
records = df_tmp_rpt.to_records(index=False)
for record in records:
     script_insert = ScriptReader.get_script(SCRIPT_PATH).format(record)
     RedshiftDataManager.run_update(script_insert, DB_CONNECTION)

Redshift expects the format ('value1','value2',null) for inserting data. That is why i try to replace all NaN with null in the dataframe. How would I achieve such thing? (I need a null value not the string 'null')

Thanks for help in advance

PyPy2000
  • 57
  • 1
  • 1
  • 6
  • 2
    Does this answer your question? [Pandas dataframe fillna() only some columns in place](https://stackoverflow.com/questions/38134012/pandas-dataframe-fillna-only-some-columns-in-place) – sushanth May 27 '20 at 13:47
  • @Sushanth I saw the topic indeed but it does not. I need all the NaN values replaced with null, not with 'None' or '0' – PyPy2000 May 27 '20 at 13:51

2 Answers2

3

This is what worked for me.

df_tmp_rpt = df_tmp_rpt.where(df_tmp_rpt.notna(), None)

This will replace all the NaN values in your Dataframe to None. None is loaded as NULL in the Database. This works in MS SQL.

1

There is no null in Python. In AWS Redshift, a null is when a value is missing or unknown. Replacing NaN with an empty string might thus work. Consider using df_tmp_rpt.fillna(value=[None]) instead of using replace().

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html#pandas-dataframe-fillna

dkreeft
  • 642
  • 5
  • 17
  • Thanks for clarifying. That means I would have to find a different solution for my problem since the integer fields expect null instead of an empty string – PyPy2000 May 27 '20 at 14:28
  • I believe passing an empty string is interpreted by Redshift as null. – dkreeft May 27 '20 at 14:33
  • That is what I thought too. After that I get an error saying that '' cant be entered into an integer field. Error is: invalid input syntax for integer: "" – PyPy2000 May 27 '20 at 14:37
  • I see, Amazon's documentation is not consistent. [Here](https://docs.aws.amazon.com/redshift/latest/dg/r_Nulls.html) I read that indeed '' is not equal to null. I will update my answer with `None` instead of '', I believe that could work. – dkreeft May 27 '20 at 14:50
  • Excuse me I should have provided you with that piece of documentation. Where should I change the NaN values to None in my code? Running: df_tmp_rpt.replace(np.nan, None, inplace=True) returns the same thing as when I do not replace anything – PyPy2000 May 27 '20 at 14:57
  • Try turning `None` into a list, see: https://stackoverflow.com/questions/17097236/replace-invalid-values-with-none-in-pandas-dataframe – dkreeft May 27 '20 at 18:05
  • That does indeed change it to None. However, Redshift still does not recognize None as a valid datatype since it needs to be null – PyPy2000 May 28 '20 at 09:54