2

For 140K of NLP data, there will be huge number of features at hand.

So increasing the number of trees from 200 to 350 led to instance crash with n_jobs = 7 | parallel jobs in a 8 CPU-cores machine. I just want to know if it works like a Pool() and demands memory? If I decrease the cores to 3 or 5, will it be useful?

Any ways for preventing memory crash?

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer = lambda x: x.split(' '), min_df = min_df, ngram_range = (1,rng))),
    ('clf',RandomForestClassifier(n_estimators = n_estimators, class_weight = class_weight, criterion = criterion, min_samples_split = min_samples_split,
                           max_features = max_features, oob_score = oob_score, warm_start = warm_start,
                           n_jobs = 7, random_state = SEED)),
    ])

    del train, test
    gc.collect()
    
    pipeline.fit(train_x, train_y)
    acc = accuracy_score(pipeline.predict(test_x), test_y)
    print(acc, top_n(pipeline,test_x,test_y))

enter image description here

user3666197
  • 1
  • 6
  • 50
  • 92
Deshwal
  • 3,436
  • 4
  • 35
  • 94

1 Answers1

0

Q : Any ways for preventing memory crash ?

A :
Sure,
similar work-experience since Win-XP crashing Py2.7 n_jobs in the very same direction.

Solution :

Step 1 :
profile the RAM-allocation envelope for having a single n_jobs = 1 set in the otherwise same settings ( one can create, even inside Python-interpreter a SIGNAL-based automated memory monitor / recorder to indeed profile real-virtual-...-memory usage, strobing for any raw or coarse time-quanta in doing so & logging all such data or just a moving-window maximums - this goes beyond the scope of this post, yet one may reuse my other posts on this and use Python-signal.signal(signal.SIGUSR1,...)-tools for the hand-made "monitor-n-logger" )

Step 2 :
having obtained the RAM-allocation & virtual memory usage, from visual inspection or from hard-data collected in Step 1, try to spawn as many n_jobs that fit the physical RAM (minus one, if you use the host for other work, during the such RandomForrestPREDICTOR.fit()-computing - my use-case was using a pool of dedicated, headless hosts for .fit()-training cavalry of predictor models on many machines in embarrasingly parallel orchestration, each having 100% CPU-usage and zero-memory-I/O swaps for about 30+ hours to meet the large-scale re-training & model selection deadlines )

This said, CPU-usage is not your main enemy here, the RAM-efficient computing is.

Your graph shows :

- swap-thrashing did not start ( good )

- cpu-hopping did start at a cost of decreasing the cpu-cache data re-use, once a job was moved away, data previously kept in LRU-cache needs to get re-fetched from RAM, which is about 1,000x slower, than if sourcing them from cpu-core local L1d-cache ( thermal throttling on big workloads make cpu-cores hot and the hardware starts to moving jobs from one core to another (hopefully a cooler one))

- cpu workloads are not as hell-hard as they could be (in some even harder number-crunching), as RandomForestPREDICTOR-s are moving and crawling through all the [M,N]-sized ( N being the number of examples in a sub-set elected for the .fit()-training, yet still "long" ) data, during the computation, which gives cpu some time to rest, while waiting for next part of data being fed in. So there are times, where CPU-core is not harnessed to its 100% capacity )

halfer
  • 19,824
  • 17
  • 99
  • 186
user3666197
  • 1
  • 6
  • 50
  • 92