Problem Formulation
Suppose I have several 10000*10000 grids (can be transformed to 10000*10000 grayscale images. I would regard
image
andgrid
as the same below), and at each grid-point, there is some value (in my case it's the number of copies of a specific gene expressed at that pixel location, note that the locations are same for every grid). What I want is to quantify the similarity between two 2D spatial point-patterns of this kind (i.e., the spatial expression patterns of two distinct genes), and rank all pairs of genes in a "most similar" to "most dissimilar" manner. Note that it is not the spatial pattern in terms of the absolute value of expression level that I care about, rather, it's the relative pattern that I care about. As a result, I might need to utilize some correlation instead of distance metrics when comparing corresponding pixels.The easiest method might be directly viewing all pixels together as a vector and calculate some correlation metric between the two vectors. However, this does not take the spatial information into account. Those genes that I am most interested in have spatial patterns, i.e., clustering and autocorrelation effects their expression pattern (though their "cluster" might take a very thin shape rather than sticking together, e.g., genes specific to the skin cells), which means usually the image would have several peak local regions, while expression levels at other pixels would be extremely low (near 0).
Possible Directions
I am not exactly sure if I should (1) consider applying image similarity comparison algorithms from image processing that take local structure similarity into account (e.g., SSIM, SIFT, as outlined in Simple and fast method to compare images for similarity), or (2) consider applying spatial similarity comparison algorithms from spatial statistics in GIS (there are some papers about this, but I am not sure if there are some algorithms dealing with simple point data rather than the normal region data with shape (in a more GIS-sense way, I need to find an algorithm dealing with raster data rather than polygon data)), or (3) consider directly applying statistical methods that deal with discrete 2D distributions, which I think might be a bit crude (seems to disregard the regional clustering/autocorrelation effects, ~ Tobler's First Law of Geography).
For direction (1), I am thinking about a simple method, that is, first find some "peak" regions in the two images respectively and regard their union as ROIs, and then compare those ROIs in the two images specifically in a simple pixel-by-pixel way (regard them together as a vector), but I am not sure if I can replace the distance metrics with correlation metrics, and am a bit worried that many methods of similarity comparison in image processing might not work well when the two images are dissimilar. For direction (2), I think this direction might be more appropriate because this problem is indeed related to spatial statistics, but I do not yet know where to start in GIS. I guess direction (3) is somewhat masked by (2), so I might not consider it here.
Sample
Sample image: (There are some issues w/ my own data, so here I borrowed an image from SpatialLIBD http://research.libd.org/spatialLIBD/reference/sce_image_grid_gene.html)
Let's say the value at each pixel is discretely valued between 0 and 10 (could be scaled to [0,1] if needed). The shapes of tissues in the right and left subfigure are a bit different, but in my case they are exactly the same.
PS: There is one might-be-serious problem regarding spatial statistics though. The expression of certain marker genes of a specific cell type might not be clustered in a bulk, but in the shape of a thin layer or irregularly. For example, if the grid is a section of the brain, then the high-expression peak region for cortex layer-specific genes (e.g., Ctip2 for layer V) might form a thin arc curved layer in the 10000*10000 grid.
UPDATE: I found a method belonging to the (3) direction called "optimal transport" problem that might be useful. Looks like it integrates locality information into the comparison of distribution. Would try to test this way (seems to be the easiest to code among all three directions?) tomorrow.
Any thoughts would be greatly appreciated!