6

I am trying two different lines of code that both involve computing combinations of rows of a df with 500k rows.

I think bc of the large # of combinations, the kernal keeps dying. Is there anyway to resolve this ?

enter image description here

Both lines of code that crash are

pd.merge(df.assign(key=0), df.assign(key=0), on='key').drop('key', axis=1)

and

index_comb = list(combinations(df.index, 2))

Both are different ways to achieve same desired df but kernal fails on both.

Would appreciate any help :/

Update: I tried using the code in my terminal and it gives me an error of killed 9: it is using too much memory in terminal as well?

NewEyes
  • 407
  • 4
  • 15
Chris90
  • 1,868
  • 5
  • 20
  • 42

1 Answers1

1

There is no solution here that I know of. Jupyter Notebook simply is not designed to handle huge quantities of data. Compile your code in a terminal, that should work.

In case you run into the same problem when using a terminal look here: Python Killed: 9 when running a code using dictionaries created from 2 csv files

Edit: I ended up finding a way to potentially solve this: Increasing your container size should prevent Jupyter from running out of memory. In order to do so open the settings.cfg file of jupyter in the home Directory of your Notebook $CHORUS_NOTEBOOK_HOME The line to edit is this one:

#default memory per container

MEM_LIMIT_PER_CONTAINER=“1g”

The default value should be 1 gb per container, increasing this to 2 or 4 gb should help with memory related crashes. However I am unsure of any implications this has on performance, so be warned!

NewEyes
  • 407
  • 4
  • 15
  • 1
    Thanks :) I dont really use terminal - would reading csvs and all that work? pd.read_csv('x.csv') would I need to alter this to point to the directory where the csv is as well? – Chris90 Mar 11 '19 at 08:40
  • also if you want to output a df as a csv from terminal can you direct that csv to live somwhere specific directory wise? @neweyes – Chris90 Mar 11 '19 at 08:41
  • Code almost never needs to be altered when coming from a notebook. Ipython works under jupyter so you should simply be able to copy paste your code into a file (located in the same directory as the notebook) and run it via typing: ipython filename.py – NewEyes Mar 11 '19 at 08:50
  • I am having trouble understanding your second quiestion... Do you just want to save your output at a specific location? – NewEyes Mar 11 '19 at 08:52
  • Sorry about that - I updated post - I tried in terminal and I get killed 9 error - it seems too less memory for terminal as well? I dont understand @neweyes – Chris90 Mar 11 '19 at 09:01
  • Edited a link into my answer. – NewEyes Mar 11 '19 at 10:12
  • 2
    @NewEyes Could you elaborate on what "huge quantities" of data is? I've used 10Ks datasets without a hitch, and I'm just wondering where the performance degradations occur that make you say this. – Dave Liu Aug 06 '19 at 19:47
  • First of all I am unsure where a significant performance degradation actually happens with increasing data. Jupyter just crashes at some point, when the data becomes too big. How big too big is depends on how much memory you have enabled for your containers. I recently found out how to do this and will edit it into my answer – NewEyes Aug 07 '19 at 06:50