How do I split a custom dataset into spatially disjoint training and test datasets in Python?

Question

My question is close to this thread but the difference is I want my training and test dataset to be spatially disjoint. So no two samples from the same geographical region --you can also define the region by county, state, random geographical grid you create for you own dataset among others. An example of my dataset is like THIS which is an instance segmentation task for satellite imagery.

I know pytorch has this capability for random splitting:

train_size = int(0.75 * len(full_dataset))
test_size = len(full_dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])

However perhaps what I want is spatially_random_spliting functionality. Picture below is also showing the question where in my case each point is an image with associated labels.

I doubt there's anything built-in. So it's for you to implement. Just look at source code `torch.utils.data.random_split` and make a similar function. Probably some new parameters will be needed for solving your task — Alexey S. Larionov, Feb 08 '22 at 15:21
How is the grid you show defined? Why can't you just pick a random partitioning of these grid squares to define your splits? — jodag, Feb 08 '22 at 15:27
I agree with @jodag. Since each image is from a unique grid cell, any partitioning of this data would ensure that there is no spatial overlap between train and test sets, in other words you can do this exactly as you would do any data partitioning task. Generate a random set of test indices that index your data, and hold these data out during training — DerekG, Feb 08 '22 at 16:21
@jodag I get that picture from an R library (https://github.com/rvalavi/blockCV) to make my point -not related to my case. I want to do the same but the problem is that in the picture each point represent a row in a data frame to address spatial resampling in a conventional ML task, while in my case I can't create such a data frame bc each point from the picture in my case represent an image (300*300 pix). — Sheykhmousa, Feb 09 '22 at 08:46

score 1 · Answer 1 · answered Feb 08 '22 at 16:27

I am not completely sure what your dataset and labels look like but from what i see why not cut image into pre defined chunk sizes like here - https://stackoverflow.com/a/63815878/4471672

and say save each chunk in different folders according to location then sample from whichever set you need (or know to be "spatially disjoint) randomly

score 0 · Answer 2 · answered Feb 14 '22 at 13:23

I found the answer via TorchGEO library. Thank you all.

from torchgeo.samplers import RandomGeoSampler

sampler = RandomGeoSampler(dataset, size=256, length=10000)
dataloader = DataLoader(dataset, batch_size=128, sampler=sampler, 
collate_fn=stack_samples)

How do I split a custom dataset into spatially disjoint training and test datasets in Python?

2 Answers2