I am trying to implement algorithms for 1000-dimensional data with 200k+ datapoints in python. I want to use numpy, scipy, sklearn, networkx, and other useful libraries. I want to perform operations such as pairwise distance between all of the points and do clustering on all of the points. I have implemented working algorithms that perform what I want with reasonable complexity but when I try to scale them to all of my data I run out of RAM. Of course, I do, creating the matrix for pairwise distances on 200k+ data takes a lot of memory.
Here comes the catch: I would really like to do this on crappy computers with low amounts of RAM.
Is there a feasible way for me to make this work without the constraints of low RAM? That it will take a much longer time is really not a problem, as long as the time reqs don't go to infinity!
I would like to be able to put my algorithms to work and then come back an hour or five later and not have it stuck because it ran out of RAM! I would like to implement this in python, and be able to use the numpy, scipy, sklearn, and networkx libraries. I would like to be able to calculate the pairwise distance to all my points etc
Is this feasible? And how would I go about it, what can I start to read up on?