1

My question is close to this thread but the difference is I want my training and test dataset to be spatially disjoint. So no two samples from the same geographical region --you can also define the region by county, state, random geographical grid you create for you own dataset among others. An example of my dataset is like THIS which is an instance segmentation task for satellite imagery.

I know pytorch has this capability for random splitting:

train_size = int(0.75 * len(full_dataset))
test_size = len(full_dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])

However perhaps what I want is spatially_random_spliting functionality. Picture below is also showing the question where in my case each point is an image with associated labels.

enter image description here

Sheykhmousa
  • 139
  • 9
  • I doubt there's anything built-in. So it's for you to implement. Just look at source code `torch.utils.data.random_split` and make a similar function. Probably some new parameters will be needed for solving your task – Alexey S. Larionov Feb 08 '22 at 15:21
  • How is the grid you show defined? Why can't you just pick a random partitioning of these grid squares to define your splits? – jodag Feb 08 '22 at 15:27
  • I agree with @jodag. Since each image is from a unique grid cell, any partitioning of this data would ensure that there is no spatial overlap between train and test sets, in other words you can do this exactly as you would do any data partitioning task. Generate a random set of test indices that index your data, and hold these data out during training – DerekG Feb 08 '22 at 16:21
  • @jodag I get that picture from an R library (https://github.com/rvalavi/blockCV) to make my point -not related to my case. I want to do the same but the problem is that in the picture each point represent a row in a data frame to address spatial resampling in a conventional ML task, while in my case I can't create such a data frame bc each point from the picture in my case represent an image (300*300 pix). – Sheykhmousa Feb 09 '22 at 08:46

2 Answers2

1

I am not completely sure what your dataset and labels look like but from what i see why not cut image into pre defined chunk sizes like here - https://stackoverflow.com/a/63815878/4471672

enter image description here

and say save each chunk in different folders according to location then sample from whichever set you need (or know to be "spatially disjoint) randomly

Yev Guyduy
  • 1,371
  • 12
  • 13
0

I found the answer via TorchGEO library. Thank you all.

from torchgeo.samplers import RandomGeoSampler

sampler = RandomGeoSampler(dataset, size=256, length=10000)
dataloader = DataLoader(dataset, batch_size=128, sampler=sampler, 
collate_fn=stack_samples)
Sheykhmousa
  • 139
  • 9