In my real problem, I'll have two tables of information (x,y). x will have ~2.6 million records and y will have ~10K; the two tables have a many to one (x->y) relationship. I want to subset x based on y.
The posts that I thought matched best were this and that and also this. I settled on numpy arrays. I'm open to using other data structures; I was just trying to pick something that would scale. Am I using an appropriate approach? Are there other posts that cover this? I didn't want to have to use a database since I'm only doing this once.
The following code tries to illustrate what I'm trying to do.
import numpy, copy
x=numpy.array([(1,'a'), (1, 'b'), (3,'a'), (3, 'b'), (3, 'c'), (4, 'd')], dtype=[('id', int),('category', str, 22)] )
y=numpy.array([('a', 3.2, 0), ('b', -1, 0), ('c', 0, 0), ('d', 100, 0)], dtype=[('category', str, 20), ('value', float), ('output', int)] )
for id, category in x:
if y[y['category']==category]['value'][0] > 3:
y[y['category']==category]['output']=numpy.array(copy.deepcopy(id))