1

I have a very simple task: I need to take a sum of 1 column in a file that has many columns and thousand of rows. However, every time I open the file on jupyter, it crashes since I cannot go over 100 MB per file.

Is there any work around for such a task? I feel I shouldnt have to open the entire file since I need just 1 column.

Thanks!

Navy Seal
  • 125
  • 1
  • 14

3 Answers3

2

I'm not sure if this will work since the information you have provided is somewhat limited, but if you're using python 3 I had a similar issue. Try typing this at the top and see if this helps. It might fix your issue.

import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'

The above solution is sort of a band-aid and isn't supported and may cause undefined behavior. If your data is too big for your memory try reading in the data with dask.

import dask.dataframe as dd
dd.read_csv(path, params)
Matt Elgazar
  • 707
  • 1
  • 8
  • 21
  • I understand what youre saying but I cant even start typing the code. While I want to use that line- could you perhaps explain what this actually does? I cannot risk having the calculation off by even a cent. Much appreciated.. – Navy Seal Nov 10 '18 at 07:53
  • 1
    It sort of "ignores" whatever difficulties your program is having and allows your program to continue running anyway. Your question states that you can't read the file. My first suggested solution is written before you read the file. Also...your system may just not be able to handle this much data given your memory. My second solution is to use dask (or cloud services) to read in your data. If you are sure that your computer has enough memory then I'm not exactly sure what could be causing this. What you could do is this: import dask.dataframe as dd ddf = dd.read_csv(path) ddf.head(2) – Matt Elgazar Nov 10 '18 at 08:00
2

You have to open the file even if you want just one row, .. opening it load it into some other memory and here is your problem .

You can either open the file outside Ipython and split it to smaller size OR

Use a library like pandas and read it in chunks , as in the answer

Yasin Yousif
  • 969
  • 7
  • 23
1

You should slice through rows and put it in different other data frames and then works on respective data frames. Hanging issues are because of RAM insufficiency in your system.

Use new_dataframe = dataframe.iloc[: , :]- or new_dataframe = dataframe.loc[: , :]-methods for slicing in pandas.

Rows slicing before colon and column slicing after colon.

csabinho
  • 1,579
  • 1
  • 18
  • 28
loving_guy
  • 381
  • 4
  • 15