0

I have a number of CSV files with x, y, and z coordinates. These coordinates are not long/lat, but rather a distance from an origin. So within the CSV, there is a 0,0 origin, and all other x, y locations are a distance from that origin point in meters.

The x, and y values will be both negative and positive float values. The largest file I have is ~1.4 million data points, the smallest is ~20k.

The files represent an irregular shaped map of sorts. The distance values will not produce a uniform shape such as a rectangle, circle, etc. I need to generate a bounding box that fits the most area within the values that are contained within the csv files.

logically, here are the steps I want to take.

  • Read the points from the file
  • Get the minimum and maximum x coordinates
  • get the minimum and maximum y coordinates.
  • Use min/max coordinates to get a bounding rectangle with (xmin,ymin), (xmax,ymin), (xmin,ymax) and (xmax,ymax) that will contain the entirety of the values of the CSV file.
  • Create a grid across that rectangle with a 1 m resolution. Set that grid as a boolean array for the occupancy.
  • Round the map coordinates to the nearest integer.
  • For every rounded map coordinate switch the occupancy to True.
  • Use a morphological filter to erode the edges of the occupancy map.
  • Now when a point is selected check the nearest integer value and whether it falls within the occupancy map.

I'm facing multiple issues, but thus far my biggest issue is memory resources. for some reason this script keeps dying with a SIGKILL, or at least I think that is what is occuring.

class GridBuilder:
    """_"""

    def __init__(self, filename, search_radius) -> None:
        """..."""

        self.filename = filename
        self.search_radius = search_radius

        self.load_map()
        self.process_points()

    def load_map(self):
        """..."""
        data = np.loadtxt(self.filename, delimiter=",")

        self.x_coord = data[:, 0]
        self.y_coord = data[:, 1]
        self.z_coord = data[:, 2]

    def process_points(self):
        """..."""

        min_x = math.floor(np.min(self.x_coord))
        min_y = math.floor(np.min(self.y_coord))

        max_x = math.floor(np.max(self.x_coord))
        max_y = math.floor(np.max(self.y_coord))

        int_x_coord = np.floor(self.x_coord).astype(np.int32)
        int_y_coord = np.floor(self.y_coord).astype(np.int32)

        x = np.arange(min_x, max_x, 1)
        y = np.arange(min_y, max_y, 1)

        xx, yy = np.meshgrid(x, y, copy=False)

if __name__ == "__main__":
    """..."""
    MAP_FILE_DIR = r"/sample_data"
    FILE = "testfile.csv"
    fname = os.path.join(MAP_FILE_DIR, FILE)
    builder = GridBuilder(fname, 500)

my plan was to take the grid with the coordinates and update each location with a dataclass.

@dataclass
class LocationData:
    """..."""

    coord: list
    occupied: bool

This identifies the grid location, and if its found within the CSV file map.

I understand this is going to be time consuming process, but I figured this would be my first attempt.

I know Stackoverflow generally dislikes attachements, but I figured it might be useful for a sample dataset of what I'm working with. So I've uploaded a file for sharing. test_file

UPDATE: the original code utilized itertools to generate a grid for each location. I ended up switching away from itertools, and utilized numpy meshgrid() instead. This caused the same issue, but meshgrid() has a copy parameter that can be set to False to preserver memory resources. this fixed the memory issue.

Michael
  • 174
  • 2
  • 13
  • 1
    Check the kernel logs. You'll probably see the OOM killer active, meaning you ran the system out of RAM and the process was killed to free some up. – Charles Duffy Dec 13 '22 at 14:53
  • 1
    It's possible to avoid that if you turn off memory overcommit -- so you get allocation failures when an application requests more memory than the system has instead of a SIGKILL later after your try to cash the bad check you were given (in the form of allocated virtual memory addresses where there isn't enough physical memory to back them) – Charles Duffy Dec 13 '22 at 14:55
  • 1
    This is something modern kernels do because applications often request more memory -- often, _much_ more memory -- than they're going to actually use. – Charles Duffy Dec 13 '22 at 14:57
  • I changed the code from the itertool to numpy and utilized the Meshgrid() function. It also created the same issue, however, when I set the copy parameter to false, it corrected the issue. – Michael Dec 13 '22 at 14:57
  • 1
    Great. Think about putting that information in a separate answer added with "Add an Answer" -- answers don't belong in question text. – Charles Duffy Dec 13 '22 at 16:08

2 Answers2

0

As others mentioned, your process is probably being killed by OOM killer for using too much memory.

You don't really need to copy the whole file to memory (which is what np.loadtxt does), you also don't need to copy values (which is what data[:, 0] does).

You can read file line by line and calculate min/max values with something like this:

min_x = float("+inf")
max_x = float("-inf")
min_y = float("+inf")
max_y = float("-inf")
with open("/your/file/path", "r") as f:
    for row in f:
        cols = row.split(",")
        x, y = cols[0], cols[1]
        min_x = min(x, min_x)
        max_x = max(x, max_x)
        min_y = min(y, min_y)
        max_y = max(y, max_y)

This uses minimal memory.

previous_developer
  • 10,579
  • 6
  • 41
  • 66
  • You are correct, however, i do need the data open as there are several other processes that will be running as part of this class that will need to access the file's data at any given time. – Michael Dec 14 '22 at 13:27
0

Ok, so with the aid of several other users I've come up with the solution to my immediate issue.

I solved the memory error by utilizing the numpy meshgrid function, and disabling the copy parameter.

xx, yy = np.meshgrid(x, y, copy=False)

MeshGrid() Documentation

However, the implementation was overkill, and I only needed two lists, an X and Y list that was composed of their respective min and max values spaced at intervals of 1. This allowed me to create the grid map I was looking for, with no memory issues.

def process_points(self):
    """..."""

    int_x_coord = np.floor(self.x_coord).astype(np.int32)
    int_y_coord = np.floor(self.y_coord).astype(np.int32)

    min_x = np.min(int_x_coord)
    min_y = np.min(int_y_coord)
    max_x = np.max(int_x_coord)
    max_y = np.max(int_y_coord)

    x = np.arange(min_x, max_x + 1, 1)
    y = np.arange(min_y, max_y + 1, 1)

    grid = np.zeros(shape=(len(x), len(y)))

    for x_coord, y_coord in zip(int_x_coord, int_y_coord):

        x_pos = np.where(x == x_coord)[0][0]
        y_pos = np.where(y == y_coord)[0][0]
        grid[x_pos][y_pos] = 1
Michael
  • 174
  • 2
  • 13