0

I have a data set with the dimensions year, lat, lon, with the variable x, which I am doing a function on to determine some other statistics.

For one lat/lon cell I chose, I was able to rank the single x value for each of the 1,000 years from greatest to least and removed the nans, leaving me with a sorted 1D array. From there, I did a function to determine a given rank value, then pulled the x value at that given rank from the 1D array.

Example:

array of x values = [6, 10, 5, nan, 4, nan, 3]

sorted array = [10, 6, 5, 4, 3]

pull x value at calculated rank, say rank=2

final ranked value at that lat/lon = 6

This process works great for a single point, but I am trying to do this process for every grid cell lat/lon within the entire array, which I feel should be simple, but I am having trouble applying these functions to a full array

Thank you!

rww95
  • 11
  • 1

1 Answers1

0

It would likely help if you included the code for your single cell. Also, where are you having trouble?

If your array fits in memory, you can always get the underlying numpy array with .values, and then apply e.g. sort or argsort; just make sure you choose the right axis. These numpy functions generally always work on the entire array.

https://numpy.org/doc/stable/reference/generated/numpy.sort.html https://numpy.org/doc/stable/reference/generated/numpy.argsort.html

Note that for memory access effiency, ideally you sort over the last axis. This might require transposing your array so that year is the last dimension.

See this answer for more background: What is the difference between contiguous and non-contiguous arrays?

apply_ufunc, like Ray Bell suggests in the comments is probably the nicest solution.

If the trouble is that your array is too large to fit in memory, try reading the dataset in chunks (over x and y, not year), and the apply_ufunc approach becomes necessary to stream the data with dask.

Huite Bootsma
  • 451
  • 2
  • 6