1

I have a df_final pandas v1.3.4 dataframe and am exporting it to a CSV file so I don't need to repeat the dataframe building step every time I do an analysis. df_final will be a 13000 x 91 dataframe, but I am testing the process on a smaller 689x91 dataframe first.

I would like to confirm that the df_final_csv dataframe generated by reading in the df_final CSV is the same as the df_final dataframe. Based on the below, it looks like they are different. However, I'm not sure how. I copied some stack overflow code (below, adapted from here) but some other solutions (eg) dont work as I have list objects in my df_final. How can I find what value(s) are causing the issue?

If any other information would help please let me know.

#689 rows x 91 columns
df_final = pd.DataFrame.from_dict(results)                                
print (f'NaN are present:  {df_final.isnull().values.any()}')# False

#export to csv
df_final.to_csv('integrated_df.csv')

#read in csv
df_final_csv = pd.read_csv('integrated_df.csv', index_col = 0)
print (f' NaN are present:  {df_final_csv .isnull().values.any()}')# False')
print (f'imported df is same as exported df:  {df_final.equals(df_final_csv)}')#False 

#try and find discrepancies (--> empty df)     
different_values = df_final_csv [~df_final_csv .isin(df_final)].dropna() #empty df with only column headers 

Cheers!

Tim Kirkwood
  • 598
  • 2
  • 7
  • 18
  • 1
    `pd.read_csv` may not restore the same data types. So your integers and floats may still appear as strings. You'd have to call `.astype` and set the types for `df_final_csv` before doing any comparisons. Also, you may want to consider using `pickle` which would preserve the types. [Here's](https://stackoverflow.com/a/62222676/1520594) an answer that can help you decide if pickle is appropriate. – algrebe Nov 07 '21 at 04:16
  • 1
    Maybe there are some special characters which CSV messed up. try to write in .pkl file, you'll get 100% same data. `import pickle; pickle.dump(df, open("df.pkl", 'wb')); # then read it ; df_new = pickle.load(open("df.pkl", 'rb'))` – Amir saleem Nov 07 '21 at 04:20
  • Hi both, thanks for your replies. Both were useful and algrebe your link was really relevant but Amir's is the code I used to fix the actual problem, so @Amir saleem if you make your comment an answer I'll accept it (as I'm pretty sure I can't accept both). Thanks again to you both! – Tim Kirkwood Nov 07 '21 at 07:14
  • Thank you @TimKirkwood, I posted it at an answer – Amir saleem Nov 07 '21 at 08:36

1 Answers1

1

Maybe there are some special characters which CSV messed up. try to write in .pkl file, you'll get 100% same data.

import pickle
# write into pickle file
pickle.dump(df, open("df.pkl", 'wb'))

# then read it
df_new = pickle.load(open("df.pkl", 'rb'))
Amir saleem
  • 1,404
  • 1
  • 8
  • 11