1

The Problem

I'm a physics graduate research assistant, and I'm trying to build a very ambitious Python code that, boiled down to the basics, evaluates a line integral in a background described by very large arrays of data.

I'm generating large sets of data that are essentially t x n x n arrays, with n and t on the order of 100. They represent the time-varying temperatures and velocities of a 2 dimensional space. I need to collect many of these grids, then randomly choose a dataset and calculate a numerical integral dependent on the grid data along some random path through the plane (essentially 3 separate grids: x-velocity, y-velocity, and temperature, as the vector information is important). The end goal is gross amounts of statistics on the integral values for given datasets.

Visual example of time-varying background.

That essentially entails being able to sample the background at particular points in time and space, say like (t, x, y). The data begins as a big ol' table of points, with each row organized as ['time','xpos','ypos','temp','xvel','yvel'], with an entry for each (xpos, ypos) point in each timestep time, and I can massage it how I like.

The issue is that I need to sample thousands of random paths in many different backgrounds, so time is a big concern. Barring the time required to generate the grids, the biggest holdup is being able to access the data points on the fly.

I previously built a proof of concept of this project in Mathematica, which is much friendlier to the analytic mindset that I'm approaching the project from. In that prototype, I read in a collection of 10 backgrounds and used Mathematica's ListInterpolation[] function to generate a continuous function that represented the discrete grid data. I stored these 10 interpolating functions in an array and randomly called them when calculating my numerical integrals.

This method worked well enough, after some massaging, for 10 datasets, but as the project moves forward that may rapidly expand to, say, 10000 datasets. Eventually, the project is likely to be set up to generate the datasets on the fly, saving each for future analysis if necessary, but that is some time off. By then, it will be running on a large cluster machine that should be able to parallelize a lot of this.

In the meantime, I would like to generate some number of datasets and be able to sample from them at will in whatever is likely to be the fastest process. The data must be interpolated to be continuous, but beyond that I can be very flexible with how I want to do this. My plan was to do something similar to the above, but I was hoping to find a way to generate these interpolating functions for each dataset ahead of time, then save them to some file. The code would then randomly select a background, load its interpolating function, and evaluate the line integral.

Initial Research

While hunting to see if someone else had already asked a similar question, I came across this:

Fast interpolation of grid data

The OP seemed interested in just getting back a tighter grid rather than a callable function, which might be useful to me if all else fails, but the solutions also seemed to rely on methods that are limited by the size of my datasets.

I've been Googling about for interpolation packages that could get at what I want. The only things I've found that seem to fit the bill are:

Scipy griddata()

Scipy interpn()

Numpy interp()

Attempts at a Solution

I have one sample dataset (I would make it available, but it's about 200MB or so), and I'm trying to generate and store an interpolating function for the temperature grid. Even just this step is proving pretty troubling for me, since I'm not very fluent in Python. I found that it was slightly faster to load the data through pandas, cut to the sections I'm interested in, and then stick this in a numpy array.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.interpolate import griddata

# Load grid data from file
gridData = pd.read_fwf('Backgrounds\\viscous_14_moments_evo.dat', header=None, names=['time','xpos','ypos','temp','xvel','yvel'])

# Set grid parameters
# nGridSpaces is total number of grid spaces / bins.
# Will be data-dependent in the future.
nGridSpaces = 27225

# Number of timesteps is gridData's time column divided by number of grid spaces.
NT = int(len(gridData['time'])/nGridSpaces)

From here, I've tried to use Scipy's interpnd() and griddata() functions, to no avail. I believe I'm just not understanding how it wants to take the data. I think that my issue with both is trying to corral the (t, x, y) "points" corresponding to the temperature values into a useable form.

The main thrust of my efforts has been trying to get them into Numpy's meshgrid(), but I believe that maybe I'm hitting the upper limit of the size of data Numpy will take for this sort of thing.

# Lists of points individually
tList=np.ndarray.flatten(pd.DataFrame(gridData[['time']]).to_numpy())
xList=np.ndarray.flatten(pd.DataFrame(gridData[['xpos']]).to_numpy())
yList=np.ndarray.flatten(pd.DataFrame(gridData[['ypos']]).to_numpy())

# 3D grid of points
points = np.meshgrid(tList, xList, yList)

# List of temperature values
tempValues=np.ndarray.flatten(pd.DataFrame(gridData[['temp']]).to_numpy())

# Interpolate and spit out a value for a point somewhere central-ish as a check
point = np.array([1,80,80])
griddata(points, tempValues, point)

This returns a value error on the line calling meshgrid():

ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

The Questions

First off... What limitations are there on the sizes of datasets I'm using? I wasn't able to find anything in Numpy's documentation about maximum sizes.

Next... Does this even sound like a smart plan? Can anyone think of a smarter framework to get where I want to go?

What are the biggest impacts on speed when handling these types of large arrays, and how can I mitigate them?

Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
Jopacabra
  • 11
  • 2
  • `"I was hoping to find a way to generate these interpolating functions for each dataset ahead of time, then save them to some file"` - I don't understand this. Interpolations (especially the linear ones that you've linked) are very cheap. Moving data around is what's expensive. Saving data to disk and loading it back is even more expensive. What you've describing so far, even with 10000 datasets, corresponds to ~120 gigs of data. This could, in principle, fit into a single fat desktop. With a "large cluster", it should be no problem at all. So, where exactly is the problem? – Andrey Tyukin Nov 02 '21 at 21:24
  • Well, the main idea on that is just being able to access the data faster (as a human). If I want to make an interactive object to showcase different parts of the calculation, ( in Mathematica) it works about 10 times faster if I have the interpolated function rather than the grid. As for having them saved and not interpolating at runtime, that just helps me if I need to really quickly pull up one of those interactive objects on the fly to discuss with collaborators, rather than needing to set everything up and wait to load the data, then interpolate while I stall trying to explain something. – Jopacabra Nov 03 '21 at 23:47

0 Answers0