0

I have imported a 14gig .csv file from Google Drive into Google drive and used pandas to sort it and also delete some columns and rows.

After deleting about a third of the rows and about half the columns of data, df_edited file.shape shows:

(27219355, 7)

To save the file, the best method I've been able to find is:

from google.colab import files

df_edited.to_csv('edited.csv')
files.download('edited.csv')

When I run this, after a long time (if it doesn't crash which happens about 1 out of 2 times), it opens a dialog box to save the file locally.

I then say yes to the save and allow it to save. However, it reduces what was originally a 14 gig .csv file that I probably cut in half to about 7 gigs to a csv file of about 100 megs.

When I open the file locally it launches excel and I am only seeing about 358,000 observations instead of what should be about 27 million. I know Excel only shows you a limited amount but the fact that the size of the csv file has been shrunk to 100 megs suggests a lot of data has been lost in the download process.

Is there anything about the code above that would cause all this data to get lost?

Or what could be causing it.

Thanks for any suggestions.

zztop
  • 701
  • 1
  • 7
  • 20
  • 1
    You check the DataFrame's size just before writing it to CSV and downloading it? It seems you can find out how big the DataFrame is, take a look [here](https://stackoverflow.com/q/18089667/11301900). – AMC Jan 12 '20 at 23:01
  • How do you do that with a csv file? df_edited is the expected size. However, when I try sys.getsizeof("edited.csv") it says literally 9 and sys.getsizeof(edited.csv) says edited not defined. This suggests to me that "edited.csv" exists but is essentially empty. If so is the problem with df_edited.to_csv('edited.csv')? – zztop Jan 13 '20 at 02:54
  • That link is about getting the memory usage of a DataFrame, that's all. `sys.getsizeof("edited.csv")` Is measuring the memory usage of the string "edited.csv", `sys.getsizeof(edited.csv)` is measuring the memory usage of the (undefined) variable `edited.csv`. – AMC Jan 13 '20 at 03:09
  • You're right. I tried a couple other files and they were all similar. I don't know how to get the size of a csv file in colag – zztop Jan 13 '20 at 03:11
  • _I don't know how to get the size of a csv file in colag_ You can probably do it the same way you would do it anywhere in Python. – AMC Jan 13 '20 at 03:15
  • Got it using" import os b = os.path.getsize("drive/My Drive/data/edited.csv") and it's the expected size. So issue is with the download code. – zztop Jan 13 '20 at 03:38

0 Answers0