0

I have a numpy array containing millions of hourly x y points with the "columns" of the array being x, y, hour, and day of week (all ints). Here is an example of what the array looks like:

array([[1, 2, 0, 0],
       [3, 5, 0, 0],
       [6, 3, 1, 0],
       [6, 2, 3, 0],
       [4, 3, 3, 1]])

I have created a grid of zeros that I can increment for all values in the array:

grid = np.zeros((8,8))
for value in range(0,len(xy_new[:,1])):  
    grid[xy_new[value][1],xy_new[value][0]] += 1

but I need to be able to do this for each hour by day of week (ie Sun at hour 0, Sun at hour 1, etc.).

How do I subset the array by hour and day of week?

I have attempted modifying the answers here: Make subset of array, based on values of two other arrays in Python, Subsetting data in Python, but have not been successful. Any help would be greatly appreciated!!

Community
  • 1
  • 1

1 Answers1

0

Presumably you want to wind up with 24 times 7 or 168 sets of accumulated counts for pairs of x and y. Suppose you have your data in a N by 4 array gdat. First, make week-hour index:

whr = 24*gdat[:,2] + gdat[:,3]

You can now select the gdat rows for each hour in your week. For example, for hour zero of Sunday:

gdat0 = gdat[whr == 0]

Do whatever summing you need with gdat0 and move on to the next hour.

Note that unique is probably a faster way to count occurrences of x, y pairs. You can play the same game of making a composite index for x and y, but you have to know how they are bounded. Supposing x runs from 0 to 120 and y runs from 0 to 5, you could make a composite index using bit fields:

xy = (gdat0[:,0] << 3) & (gdat0[:,1])

Obviously, if y has a larger range you need to shift more than 3 bits, and you may need to offset x and y to avoid negative values.

Then, use unique to return the unique values and counts for the values in xy.

xyval, xycnt = np.unique(xy, return_counts=True)

You then retrieve the x and y value pairs from xyval using bitwise operators, xyval >> 3 and xyval & 7.

Repeat for every hour in the week. Since storage will be an issue if N is huge, you probably want to re-use gdat0 on each iteration.

EDIT: The short data sample you posted is time-sequential. If all your data are time-sequential, you don't need to "select" for each hour. All you need is to find the index for each new value in whr. unique(whr, return_index=True) will find those for you as well!

Frank M
  • 1,550
  • 15
  • 15
  • Thank you! The first method you provided worked quite well once I changed gdat0 = gdat[:, whr == 0] to gdat0 = gdat[whr == 0]. The way you had it created an error: index out of range in dimension 1 – user5586329 Nov 23 '15 at 15:26