Idea for a data structure to store 2D data?

Question

I have a large 2D grid, x-by-y. The user of the application will add data about specific points on this grid. Unfortunately, the grid is far too big to be implemented as a large x-by-y array because the system on which this is running does not have enough memory.

What is a good way to implement this so that only the points that have data added to them are stored in memory?

My first idea was to create a BST of the data points. A hash function such as "(long)x<<32 + y" would be used to compare the nodes.

I then concluded that this could lose efficiency if not well balanced so I came up with the idea of having a BST of comparable BSTs of points. The outer BST would compare the inner BSTs based on their x values. The inner BSTs would compare the points by their y values (and they would all have the same x). So when the programmer wants to see if there is a point at (5,6), they would query the outer BST for 5. If an inner BST exists at that point then the programmer would query the inner BST for 6. The result would be returned.

Can you think of any better way of implementing this?

Edit: In regards to HashMaps: Most HashMaps require having an array for the lookup. One would say "data[hash(Point)] = Point();" to set a point and then find the Point by hashing it to find the index. The problem, however, is that the array would have to be the size of the range of the hash function. If this range is less than the total number of data points that are added then they would either have no room or have to be added to an overflow. Because I don't know the number of points that will be added, I would have to make an assumption that this number would be less than a certain amount and then set the array to that size. Again, this instantiates a very large array (although smaller than originally if the assumption is that there will be less data points than x*y). I would like the structure to scale linearly with the amount of data and not take up a large amount when empty.

It looks like what I want is a SparseArray, as some have mentioned. Are they implemented similarly to having a BST inside of a BST?

Edit2: Map<> is an interface. If I were to use a Map then it looks like TreeMap<> would be the best bet. So I would end up with TreeMap< TreeMap< Point> >, similar to the Map< Map< Point> > suggestions that people have made, which is basically a BST inside of a BST. Thanks for the info, though, because I didn't know that the TreeMap<> was basically the Java SDK of a BST.

Edit3: For those whom it may concern, the selected answer is the best method. Firstly, one must create a Point class that contains (x,y) and implements comparable. The Point could potentially be compared by something like (((long)x)<<32)+y). Then one would TreeMap each point to the data. Searching this is efficient because it is in a balanced tree so log(n) cost. The user can also query all of this data, or iterate through it, by using the TreeMap.entrySet() function, which returns a set of Points along with the data.

In conclusion, this allows for the space-efficient and search-efficient implementation of a sparse array, or in my case, a 2D array, that can also be iterated through efficiently.

http://stackoverflow.com/questions/390181/sparse-matrices-arrays-in-java — GriffeyDog, Jun 21 '13 at 16:09
You seem to be interested more in the underlying implementation of the data structure, instead of how you're going to use it. If you need some spacial queries (points with x between 10 and 40), or nearest neighbor queries, you could use some of the structures AlexWien mentioned, or some navigable map. If you need to look-up some specific point only, a plain old HashMap would do a good job - http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html — jmruc, Jun 21 '13 at 16:52
@Kiril Raychev: Once the points are added, I plan on using all of the data in the structure to make calculations but do not have a need for ranged queries. — Reed B, Jun 21 '13 at 16:56
Ok, seems that the Map is best for your usage. But when you get into speed an s space probelms, consider using a HashMap that is not Object based, that saves 60% of memory space. (point object vs primitive types) — AlexWien, Jun 21 '13 at 18:50
@AlexWien: If I use a primative hashmap then I would have to push a large array onto the stack, as my first edit explained. This is nearly as inefficient memory-wise as using a direct-mapped array because both require a large amount of space on start-up. If the mapping is allocated dynamically then I am able to use very little memory when there are few points (but yes, there will be pointer overhead). — Reed B, Jun 21 '13 at 19:03
as long as you dont explain your operations, its not possible to find the best structure. a b-tree with points in an morton indexed array is also possible. or a grid of hashmaps — AlexWien, Jun 21 '13 at 21:27

score 8 · Answer 1 · edited Dec 16 '16 at 11:50

8

Either a Quadtree, a k-d-tree or an R-tree.

Store index to large point array into one of the spatial structures. Such spatial structures are advantageous if the data is not equally distributed, like geographic data that concentrates in cities, and have no point in the sea.

Think if you can forget the regular grid, and stay with the quad tree.
(Think, why do you need a regular grid? A regular grid is usually only a simplification)

Under no circumstances use Objects to store a Point. Such an Object needs 20 bytes only for the fact that it is an object! A bad idea for a huge data set.

An int x[], and int[] y, or an int[]xy array is ideal related to memory usage.

Consider reading

Hanan Samet's "Foundations of Multidimensional Data Structures"

(at least the Introduction).

edited Dec 16 '16 at 11:50

Andrea Ligios

49,480
26
114
243

answered Jun 21 '13 at 16:12

AlexWien

28,470
6
53
83

@AndreaLigios, yes with these you can raise the performance by a factor of 100 to 1000, compared to your old implementation – AlexWien Jun 21 '13 at 16:34
These are good structures but a quadtree wouldn't be best because my data is arranged in discrete rows and columns instead of points distributed in a 2d continuous domain which is what the quadtree was designed for. Thanks for the answer! – Reed B Jun 21 '13 at 18:25
The quadtree was notz designed for continous coordinates. It is for integer coordinates, usually a power of two. So discrete. The quad tree is an index, not the storage itself. It used to find the points nearby with a minimum of effort. You can store you data as pints (row, col) or (x,y). is your data equally distributed or clustered on some spots? – AlexWien Jun 21 '13 at 18:37
I too recommend some kind of tree structure like the ones mentioned here to preserve the spatial correlation, whereas with the hash table approach you lose that information. – Wildhammer Apr 26 '20 at 16:46

jmruc · Accepted Answer · 2013-06-21T16:31:04.847

4

You could use a Map<Pair, Whatever> to store your data (you have to write the Pair class). If you need to iterate the data in some specific order, make Pair Comparable, and use NavigableMap

edited Jun 21 '13 at 16:31

answered Jun 21 '13 at 16:11

jmruc

5,714
3
22
41

+1 good solution; beat me to it :) I also like the mention of `NavigableMap`. – Vivin Paliath Jun 21 '13 at 16:15
Why not just use the `Point` class? – splungebob Jun 21 '13 at 16:19
@splungebob you mean `java.awt.Point`? I think it's always a bad idea to use classes intended for some entirely different purpose, just because they have the right properties. The awt point is mutable, can be set with doubles, and can have transformations applied - totally not what we need here. – jmruc Jun 21 '13 at 16:25
@KirilRaychev I have reimplemented Point to use in emebedded systems where java.awt was not available , this is more work, than it looks at first. – AlexWien Jun 21 '13 at 16:38
So if I wanted to iterate through every point in the map without checking every potential mapping, I could use the TreeMap.keySet() function to get a Set of all of the key values and then iterate through them? – Reed B Jun 21 '13 at 17:17
1

@ReedB yes you can. The recommended way is to iterate `entrySet` and not `keySet` as it is more efficient, but either will do. – jmruc Jun 21 '13 at 17:21

score 2 · Answer 3 · answered Jun 21 '13 at 16:11

One approach could be Map<Integer, Map<Integer, Data>>. The key on the outer map is the row value, and the key in the inner map is the column value. The value associated with that inner map (of type Data in this case) corresponds to the data at (row, column). Of course, this won't help if you're looking at trying to do matrix operations or such. For that you'll need sparse matrices.

Another approach is to represent the row and column as a Coordinate class or a Point class. You will need to implement equals and hashCode (should be very trivial). Then, you can represent your data as Map<Point, Data> or Map<Coordinate, Data>.

score 1 · Answer 4 · answered Jun 21 '13 at 16:11

1

You could have a list of lists of an object, and that object can encode it's horizontal and vertical position.

class MyClass
{
    int x;
    int y;
    ...
}

answered Jun 21 '13 at 16:11

Sam I am says Reinstate Monica

30,851
12
72
100

But then every time that a new Object is added, because I want to have a unique set of points, I would have to search through the list of all of the data to see if it already exists before either updating the data point or adding a new one. I was attempting to avoid this inefficient process. – Reed B Jun 21 '13 at 16:48
@ReedB it's not that innefficient, especially if you have a **list of lists** with the outer list corresponding to `x` and the inner list corresponding to `y`. the searching would be **O(x+y)** time complexity – Sam I am says Reinstate Monica Jun 21 '13 at 17:02

score 0 · Answer 5 · answered Jun 21 '13 at 16:11

Maybe I'm being too simplistic here, but I think you can just use a regular HashMap. It would contain custom Point objects as keys:

class Point {
    int x;
    int y;
}

Then you override the equals method (and thus the hashCode method) to be based on x and y. That way you only store points that have some data.

score 0 · Answer 6 · answered Jun 21 '13 at 16:11

0

I think you are on the right track to do this in a memory efficient way - it can be implemented fairly easily by using a map of maps, wrapped in a class to give a clean interface for lookups.

An alternative (and more memory efficient) approach would be to use a single map, where the key was a tuple (x,y). However, this would be less convenient if you need to make queries like 'give me all values where x == some value'.

answered Jun 21 '13 at 16:11

robjohncox

3,639
3
25
51

The map of maps looks promising. As I've said in a couple other comments, if I used a single Map that was a TreeMap then it would have to compare the nodes based on some kind of hashed value generated from the two points, like my original idea of a single BST. If this Map was a linear Map, like a list, then this would be very inefficient because every time that I wanted to add data, I would have to search linearly through the list to see if it already exists before either updating it or adding a new data point. – Reed B Jun 21 '13 at 16:53

score 0 · Answer 7 · answered Jun 21 '13 at 16:13

You might want to look at FlexCompColMatrix, CompColMatrix and other sparse matrices implementations from the Matrix toolkit project.

The performance will really depends on the write/read ratio and on the density of the matrix, but if you're using a matrix package it will be easier to experiment by switching the implementation

score 0 · Answer 8 · answered Jun 21 '13 at 17:08

0

My suggestion to you is use Commons Math: The Apache Commons Mathematics Library. Because it will save your day, by leveraging the math force that your application require.

answered Jun 21 '13 at 17:08

Daniel De León

13,196
5
87
72

Idea for a data structure to store 2D data?

8 Answers8