i'm trying to concat a list of dataarrays and then add a dimension so that each dataarray being concatenated is labeled. I thought this was a use case for expand_dims but after trying a variety of solutions from SO I am stuck. I think i'm missing something elementary about xarray. These seem the closest:
im using a pandas dataframe to compile metadata from filenames then grouping and iterating through groups to create datasets, using skimage.io.ImageCollection to load multiple image files into nparray, and ultimate create xarray objects
self-contained example
setup
#%% load libraries
from itertools import product
from PIL import Image
import numpy as np
import pandas as pd
import xarray as xr
import glob
from skimage import io
import re
#%% Synthetic data generator
ext = 'png'
delim = '_'
datadir = os.path.join('data','syn')
os.makedirs(datadir, exist_ok=True)
cartag = ['A1', 'A2']
date = ['2020-05-31', '2020-06-01', '2020-06-02']
frame = ['Fp', 'Fmp']
parameter = ['FvFm','t40', 't60']
list_vals = [cartag, date, frame, parameter]
mesh = list(product(*list_vals))
mesh = np.array(mesh)
for entry in mesh:
print(entry)
img = np.random.random_sample((8, 8))*255
img = img.astype('uint8')
fn = delim.join(entry)+'.png'
pimg = Image.fromarray(img)
pimg.save(os.path.join(datadir,fn))
#%% import synthetic images
fns = [
fn for fn in glob.glob(pathname=os.path.join(datadir, '*%s' % ext))
]
flist = list()
for fullfn in fns:
fn = os.path.basename(fullfn)
fn,_ = os.path.splitext(fn)
f = fn.split(delim)
f.append(fullfn)
flist.append(f)
fdf = pd.DataFrame(flist,
columns=[
'plantbarcode', 'timestamp',
'frame','parameter', 'filename'
])
fdf=fdf.sort_values(['timestamp','plantbarcode','parameter','frame'])
function definition
#%%
def get_tind_seconds(parameter):
tind = re.search("\d+", parameter)
if tind is not None:
tind = int(tind.group())
elif parameter == 'FvFm':
tind = 0
else:
raise ValueError("the parameter '%s' is not supported" % parameter)
return (tind)
xarray part
dfgrps = fdf.groupby(['plantbarcode', 'timestamp', 'parameter'])
ds = list()
for grp, grpdf in dfgrps:
# print(grpdf.parameter.unique())
parameter = grpdf.parameter.unique()[0]
tind = get_tind_seconds(
parameter
) #tind is an integer representing seconds since start of experiment
# print(tind)
filenames = grpdf.filename.to_list()
imgcol = io.ImageCollection(filenames)
imgstack = imgcol.concatenate() #imgstack is now 2x8x8 ndarray
indf = grpdf.frame #the 2 dim are frames Fp and Fmp
# print(indf)
arr = xr.DataArray(name=parameter,
data=imgstack,
dims=('frame', 'y', 'x'),
coords={
# 'frame': indf,
'parameter': [parameter,parameter]
# 'tind_s': [tind,tind]
},
attrs={
'jobdate': grpdf.timestamp.unique()[0],
'plantbarcode': grpdf.plantbarcode.unique()[0]
})
# arr = arr.expand_dims(
# dims={'tind_s': tind}
# ) #<- somehow I need to label each dataarray with another dimension assigning it the dim/coord `tind`
ds.append(arr)
dstest = xr.concat(ds, dim='parameter')
the goal is to have a different file for each day, plantbarcode. so in this case 4 files. where the images are indexable by parameter and frame. tind_s is typically useful for plotting the summary stat of each image for each parameter so I'd like to make that dim/coord too - i'm not sure when to use which. looks like dim has to match the data coming in, so in this case 2 frames x 8x8 pixels.
original
im using a pandas dataframe to compile metadata from filenames (here are the first few entries)
frameid plantbarcode experiment datetime jobdate cameralabel filename frame parameter
4 5 A1 doi 2020-05-31 21:01:55 2020-06-01 PSII0 data/psII/A1-doi-20200531T210155-PSII0-5.png Fp FvFm
5 6 A1 doi 2020-05-31 21:01:55 2020-06-01 PSII0 data/psII/A1-doi-20200531T210155-PSII0-6.png Fmp FvFm
6 7 A1 doi 2020-05-31 21:01:55 2020-06-01 PSII0 data/psII/A1-doi-20200531T210155-PSII0-7.png Fp t40_ALon
7 8 A1 doi 2020-05-31 21:01:55 2020-06-01 PSII0 data/psII/A1-doi-20200531T210155-PSII0-8.png Fmp t40_ALon
8 9 A1 doi 2020-05-31 21:01:55 2020-06-01 PSII0 data/psII/A1-doi-20200531T210155-PSII0-9.png Fp t60_ALon
9 10 A1 doi 2020-05-31 21:01:55 2020-06-01 PSII0 data/psII/A1-doi-20200531T210155-PSII0-10.png Fmp t60_ALon
...
then grouping and iterating through groups to create datasets, using skimage.io.ImageCollection to load multiple image files into nparray, and ultimate create xarray objects
import os
import cppcpyutils as cppc
import re
from skimage import io
import xarray as xr
import numpy as np
import pandas as pd
delimiter = "(.{2})-(.+)-(\d{8}T\d{6})-(.+)-(\d+)"
filedf = cppc.io.import_snapshots('data/psII', camera='psII', delimiter=delimiter)
filedf = filedf.reset_index().set_index('frameid')
pimframes_map = pd.read_csv('data/pimframes_map.csv',index_col = 'frameid')
filedf = filedf.join(pimframes_map, on = 'frameid').reset_index().query('frameid not in [3,4,5,6]')
dfgrps = filedf.groupby(['experiment', 'plantbarcode', 'jobdate', 'datetime', 'parameter'])
ds=list()
for grp, grpdf in dfgrps:
# print(grpdf.parameter.unique())
parameter = grpdf.parameter.unique()[0]
tind = get_tind_seconds(parameter) #tind is an integer representing seconds since start of experiment
# print(tind)
filenames = grpdf.filename.to_list()
imgcol = io.ImageCollection(filenames)
imgstack = imgcol.concatenate() #imgstack is now 2x640x480 ndarray
indf = grpdf.frame #the 2 dim are frames Fp and Fmp
# print(indf)
arr = xr.DataArray(name=parameter,
data=imgstack,
dims=('induction frame','y', 'x'),
coords={'induction frame': indf},
attrs={'plantbarcode': grpdf.plantbarcode.unique()[0],
'jobdate': grpdf.jobdate.unique()[0]})
arr = arr.expand_dims(dims = {'tind_s': tind}) #<- somehow I need to label each dataarray with another dimension assigning it the dim/coord `tind`
ds.append(arr)
the expand_dims line causes ValueError: dimensions ('dims',) must have the same length as the number of data dimensions, ndim=0
if i try to follow the second SO I linked above where I provide 'tind_s' as a coordinate it complains that there are too many relative to the dims.
ValueError: coordinate tind_s has dimensions ('tind_s',), but these are not a subset of the DataArray dimensions ('induction frame', 'y', 'x')
then i want to concat together where tind_s is a coordinate
dstest=xr.concat(ds[0:4], dim = 'tind_s')
another attempt
I did figure out that I can use np.expand_dims()
on imgstack
and then specify the extra dim and coord but it results in an array of nan. Also, the result from xr.concat() is a dataarray instead of a dataset so it can't be saved(?). is there a direct way in xarray to do this?
I also converted the attributes to dims
dfgrps = filedf.groupby(
['experiment', 'plantbarcode', 'jobdate', 'datetime', 'parameter'])
dalist = list()
for grp, grpdf in dfgrps:
print(grpdf.parameter.unique())
parameter = grpdf.parameter.unique()[0]
tind = get_tind_seconds(parameter)
# print(tind)
print(grpdf.plantbarcode.unique())
print(grpdf.jobdate.unique()[0])
filenames = grpdf.filename.to_list()
imgcol = io.ImageCollection(filenames)
imgstack = imgcol.concatenate()
imgstack = np.expand_dims(imgstack, axis=0)
imgstack = np.expand_dims(imgstack, axis=0)
imgstack = np.expand_dims(imgstack, axis=0)
indf = grpdf.frame #xr.Variable('induction frame', grpdf.frame)
# tind = xr.Variable('tind', [tind])
# print(indf)
arr = xr.DataArray(data=imgstack,
dims=('jobdate','plantbarcode', 'tind_s', 'induction frame', 'y',
'x'),
coords={
'plantbarcode': grpdf.plantbarcode.unique(),
'tind_s': [tind],
'induction frame': indf,
'jobdate': grpdf.jobdate.unique()}
)
dalist.append(arr)
ds = xr.concat(dalist, dim='jobdate')
after the for loop: print(arr)
<xarray.DataArray (jobdate: 1, plantbarcode: 1, tind_s: 1, induction frame: 2, y: 640, x: 480)>
array([[[[[[0, 0, 0, ..., 0, 0, 0],
[1, 1, 0, ..., 0, 0, 0],
[0, 0, 2, ..., 0, 0, 0],
...,
[1, 0, 0, ..., 0, 1, 0],
[1, 0, 0, ..., 0, 0, 1],
[1, 0, 0, ..., 1, 1, 0]],
[[0, 0, 0, ..., 0, 1, 1],
[2, 2, 0, ..., 0, 0, 1],
[2, 1, 1, ..., 0, 0, 0],
...,
[0, 1, 0, ..., 1, 0, 1],
[1, 0, 0, ..., 0, 1, 1],
[0, 0, 0, ..., 0, 0, 0]]]]]], dtype=uint8)
Coordinates:
* plantbarcode (plantbarcode) object 'A2'
* tind_s (tind_s) int64 60
* induction frame (induction frame) object 'Fp' 'Fmp'
* jobdate (jobdate) datetime64[ns] 2020-06-03
Dimensions without coordinates: y, x
and print(ds)
print(ds)
<xarray.DataArray (jobdate: 18, plantbarcode: 2, tind_s: 3, induction frame: 2, y: 640, x: 480)>
array([[[[[[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 0., 1., ..., 2., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 1., 0., 0., ..., 7., 0., 0.],
[ 0., 2., 4., ..., 0., 0., 4.],
[ 0., 1., 0., ..., 1., 0., 0.]],
[[ 0., 1., 0., ..., 0., 1., 0.],
[ 0., 0., 1., ..., 1., 2., 1.],
[ 0., 1., 1., ..., 1., 0., 0.],
...,
[ 1., 2., 2., ..., 0., 1., 1.],
[ 1., 1., 1., ..., 0., 1., 0.],
[ 0., 0., 2., ..., 0., 0., 1.]]],
[[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]],
[[[ 0., 0., 0., ..., 0., 0., 0.],
[ 1., 1., 0., ..., 0., 0., 0.],
[ 0., 0., 2., ..., 0., 0., 0.],
...,
[ 1., 0., 0., ..., 0., 1., 0.],
[ 1., 0., 0., ..., 0., 0., 1.],
[ 1., 0., 0., ..., 1., 1., 0.]],
[[ 0., 0., 0., ..., 0., 1., 1.],
[ 2., 2., 0., ..., 0., 0., 1.],
[ 2., 1., 1., ..., 0., 0., 0.],
...,
[ 0., 1., 0., ..., 1., 0., 1.],
[ 1., 0., 0., ..., 0., 1., 1.],
[ 0., 0., 0., ..., 0., 0., 0.]]]]]])
Coordinates:
* plantbarcode (plantbarcode) object 'A1' 'A2'
* tind_s (tind_s) int64 0 40 60
* induction frame (induction frame) object 'Fp' 'Fmp'
* jobdate (jobdate) datetime64[ns] 2020-06-01 ... 2020-06-03
Dimensions without coordinates: y, x
I don't understand where the array of nan comes from. It's also odd to me that whatever dim is used in concat this has a coord value for each entry (18 files in this case) even though they are not unique but the other dims only show as unique values.
in case anyone is willing to download a small dataset here is a link (sorry against the advice in the link, i will try to come up with a synthetic dataset that can be generated on the fly)