I'm looking for a basic pseudo-code outline here.
My goal is to code a classification tree from scratch (I'm learning machine learning and want to get intuition). But my training data is huge: 40000 examples and 1000 features. Given that the upper bound for number of splits needed is 240000, I'm lost as how to keep track of all these partitioned datasets.
Say I start with the full dataset and take one split. Then I can save the 20000ish examples that fell on one side of the split into a dataset, and re-run the splitting algorithm to find the greedy split for that dataset. Then say I keep doing this, splitting along the leftmost branches of the tree dozens of times.
When I'm satisfied with all my leftmost splits, then what? How do I store up to 240000 separate subsets? And how do I keep track of all the splits I've taken for when I'm classifying a test example? It's the organization of the code that's not making sense to me.