selecting records in a python structure based on a related structure

Question

In my real problem, I'll have two tables of information (x,y). x will have ~2.6 million records and y will have ~10K; the two tables have a many to one (x->y) relationship. I want to subset x based on y.

The posts that I thought matched best were this and that and also this. I settled on numpy arrays. I'm open to using other data structures; I was just trying to pick something that would scale. Am I using an appropriate approach? Are there other posts that cover this? I didn't want to have to use a database since I'm only doing this once.

The following code tries to illustrate what I'm trying to do.

import numpy, copy
x=numpy.array([(1,'a'), (1, 'b'), (3,'a'), (3, 'b'), (3, 'c'), (4, 'd')], dtype=[('id', int),('category', str, 22)]  )
y=numpy.array([('a', 3.2, 0), ('b', -1, 0), ('c', 0, 0), ('d', 100, 0)], dtype=[('category', str, 20), ('value', float), ('output', int)] )
for id, category in x:
    if y[y['category']==category]['value'][0] > 3:
        y[y['category']==category]['output']=numpy.array(copy.deepcopy(id))

Databases are good at joining tables, even if it's a one time operation. — Janne Karila, Apr 30 '13 at 19:11
@JanneKarila any recommendation on a really lightweight db option? I'm sure I'll need it eventually with the kind of stuff I have to do. — Roland, Apr 30 '13 at 21:33
SQLite is lightweight and part of Python standard library. http://docs.python.org/2/library/sqlite3.html — Janne Karila, May 01 '13 at 08:45

score 3 · Answer 1 · edited May 23 '17 at 11:49

3

You have to be careful when you are trying to index with a boolean array (y['category']==category) to modify the original array (y) because 'fancy indexing' returns a copy (not a view), so modifying the copy will not change your original array y. If you are just doing this on an ordinary array it works fine (this confused me in the past). But with a structured array like you're using, it won't be a view even when used as assignment, if you use the mask then index again with a fieldname. It sounds confusing, but it won't work as you've written it, notice that y is unchanged before and after:

for i, category in x:
    c = y['category']==category   #generate the mask once
    if y[c]['value'][0] > 3:
        print 'before:', y[c]['output']
        y[c]['output'] = i
        print 'after:', y[c]['output']

#output:
#before: [0]
#after: [0]
#before: [0]
#after: [0]
#before: [0]
#after: [0]

If you get a view using field access then get the fancy indexing on that view, you will get a setitem call that works:

for i, category in x:
    c = y['category']==category   #generate the mask once
    if y[c]['value'][0] > 3:
        print 'before:', y[c]['output']
        y['output'][c] = i
        print 'after:', y[c]['output']

#output:
#before: [0]
#after: [1]
#before: [1]
#after: [3]
#before: [0]
#after: [4]

As you see, I removed your copy too. i (or id, which I did not use since id is a function) is simply an integer, which does not need to be copied. If you do need to copy something, you might be better off using the numpy copy instead of the standard library copy, as in

y[...]['output'] = np.array(id, copy=True)

or

y[...]['output'] = np.copy(id)

In fact, copy=True should be default, so ... = np.array(id) is probably sufficient, but I am not an authority on copying.

edited May 23 '17 at 11:49

Community

1
1

answered Apr 30 '13 at 19:42

askewchan

45,161
17
118
134

@unutbu, You can also see the overwriting that you've avoided happening in my print statements, which confused me a bit at first. – askewchan Apr 30 '13 at 19:52
+1 for answering a question I had but didn't even ask! Also, thanks for the other pointers... – Roland Apr 30 '13 at 21:19
1

so to paraphrase the main point, when I extracted a mask with a boolean query (y['category']==category), the mask was a _new copy_ of the data and not a view onto the original y. By referencing the field I wanted to assign _after_ this new copy was created (the ['output'] text was later on the same line), my reassignment was applied to that copy and not the original. Therefore the code finishes without having touched the original data and the copy is gone. – Roland Apr 30 '13 at 21:30
2

@Roland: You've summarize the point very well. Just note that it is not the boolean *mask* (`mask = (y['category']==category)`) that makes a new copy, but the use of *fancy indexing* `y[mask]` which makes a new copy. (Basic slices return views, but [fancy indexes](http://docs.scipy.org/numpy/docs/numpy-docs/reference/arrays.indexing.rst/#arrays-indexing) always return copies.) – unutbu Apr 30 '13 at 21:53
1

@unutbu thanks for the clarification/correction. I did indeed misunderstand that despite askewchan's care in pointing this out. Apparently I've been fancy-indexing a lot without realizing that it was substantively different than ordinary indexing. Still reading the links you've both provided. – Roland Apr 30 '13 at 22:09
1

@Roland Remember that `mask.dtype` is `boolean` so it's a boolean array. You can access the same items as `y[mask]` by doing `y[mask.nonzero()]` which is a different type of fancy indexing, by giving a list (or array) of indices, as in `y[np.array(1,2,4,5)]` which returns a copy of the 2nd, 3rd, fifth and sixth item in `y`. To see how these last two things are similar, take a look at `mask` itself, and `mask.nonzero()` itself as well. – askewchan May 01 '13 at 03:42

unutbu · Accepted Answer · 2013-04-30T23:49:10.237

You have 2.6 million records which each (potentially) overwrite one of 10K records. So there could be a lot of overwriting going on. Each time you write to the same location, all the previous work done at that location was for naught.

So you could make things more efficient by looping through y (10K unique? categories) instead of looping through x (2.6M records).

import numpy as np
x = np.array([(1,'a'), (1, 'b'), (3,'a'), (3, 'b'), (3, 'c'), (4, 'd')], dtype=[('id', int),('category', str, 22)]  )
y = np.array([('a', 3.2, 0), ('b', -1, 0), ('c', 0, 0), ('d', 100, 0)], dtype=[('category', str, 20), ('value', float), ('output', int)] )

for idx in np.where(y['value'] > 3)[0]:
    row = y[idx]
    category = row['category']
    # Only the last record in `x` of the right category affects `y`.
    # So find the id value for that last record in `x`
    idval = x[x['category'] == category]['id'][-1]
    y[idx]['output'] = idval

print(y)

yields

[('a', 3.2, 3) ('b', -1.0, 0) ('c', 0.0, 0) ('d', 100.0, 4)]

This is better than my answer, and by using basic slicing, avoids the issue I was running into and spent most of my time explaining... — askewchan, Apr 30 '13 at 19:43
okay. I like iterating over the smaller dataset (I also like the np.where()--hadn't seen that before!). Thanks. — Roland, Apr 30 '13 at 21:07

selecting records in a python structure based on a related structure

2 Answers2