0

I am trying to learn ML using Kaggle datasets. In one of the problems (using Logistic regression) inputs and parameters matrices are of size (1110001, 8) & (2122640, 8) respectively.

I am getting memory error while doing it in python. This would be same for any language I guess since it's too big. My question is how do they multiply matrices in real life ML implementations (since it would usually be this big)?

Things bugging me :

  1. Some ppl in SO have suggested to calculate dot product in parts and then combine. But even then matrix would be still too big for RAM (9.42TB? in this case)

  2. And If I write it to a file wouldn't it be too slow for optimization algorithms to read from file and minimize function?

  3. Even if I do write it to file how would fmin_bfgs(or any opt. function) read from file?

  4. Also Kaggle notebook shows only 1GB of storage available. I don't think anyone would allow TBs of storage space.

  5. In my input matrix many rows have similar values for some columns. Can I use it my advantage to save space? (like sparse matrix for zeros in matrix)

    Can anyone point me to any real life sample implementation of such cases. Thanks!

Community
  • 1
  • 1
nayakasu
  • 889
  • 1
  • 11
  • 32
  • 1
    Since your trying to learn ML using Kaggle Datasets, i think if you looked at the kernels which are provided by participants, you will find some of the answers – Ryan Jul 06 '18 at 07:42
  • If you need to read the values of a file in an efficient way you can use numpy.load and use the mmap parameter toread only the parts you need. This saves time and memory and may answer a few of your questions. – Lucas Ramos Jul 06 '18 at 12:25
  • I would try to calculate the requested part of this huge matrix*matrix operation at the time when it is needed. If this isn't an option h5py with a proper compression algortihm (high efficiency on correlated vales) should also do the job. Something like this https://stackoverflow.com/a/48997927/4045774 – max9111 Jul 06 '18 at 13:12

1 Answers1

0

I have tried many things. I will be mentioning these here, if anyone needs them in future:

  • I had already cleaned up data like removing duplicates and irrelevant records depending on given problem etc.
  • I have stored large matrices which hold mostly 0s as sparse matrix.
  • I implemented the gradient descent using mini-batch method instead of plain old Batch method (theta.T dot X).

Now everything is working fine.

nayakasu
  • 889
  • 1
  • 11
  • 32