0

i'm trying to concat a list of dataarrays and then add a dimension so that each dataarray being concatenated is labeled. I thought this was a use case for expand_dims but after trying a variety of solutions from SO I am stuck. I think i'm missing something elementary about xarray. These seem the closest:

  1. Add a 'time' dimension to xarray Dataset and assign coordinates from another Dataset to it

  2. Add 'constant' dimension to xarray Dataset

im using a pandas dataframe to compile metadata from filenames then grouping and iterating through groups to create datasets, using skimage.io.ImageCollection to load multiple image files into nparray, and ultimate create xarray objects

self-contained example

setup

#%%  load libraries
from itertools import product
from PIL import Image
import numpy as np
import pandas as pd
import xarray as xr
import glob
from skimage import io
import re

#%% Synthetic data generator
ext = 'png'
delim = '_'

datadir = os.path.join('data','syn')
os.makedirs(datadir, exist_ok=True)
cartag = ['A1', 'A2']
date = ['2020-05-31', '2020-06-01', '2020-06-02']
frame = ['Fp', 'Fmp']
parameter = ['FvFm','t40', 't60']
list_vals = [cartag, date, frame, parameter]
mesh = list(product(*list_vals))
mesh = np.array(mesh)
for entry in mesh:
    print(entry)
    img = np.random.random_sample((8, 8))*255
    img = img.astype('uint8')
    fn = delim.join(entry)+'.png'
    pimg = Image.fromarray(img)
    pimg.save(os.path.join(datadir,fn))

#%% import synthetic images
fns = [
    fn for fn in glob.glob(pathname=os.path.join(datadir, '*%s' % ext))
]
flist = list()
for fullfn in fns:
    fn = os.path.basename(fullfn)
    fn,_ = os.path.splitext(fn)
    f = fn.split(delim)
    f.append(fullfn)
    flist.append(f)

fdf = pd.DataFrame(flist,
                columns=[
                    'plantbarcode', 'timestamp',
                    'frame','parameter', 'filename'
                ])
fdf=fdf.sort_values(['timestamp','plantbarcode','parameter','frame'])

function definition

#%%
def get_tind_seconds(parameter):
    tind = re.search("\d+", parameter)
    if tind is not None:
        tind = int(tind.group())
    elif parameter == 'FvFm':
        tind = 0
    else:
        raise ValueError("the parameter '%s' is not supported" % parameter)
    return (tind)

xarray part

dfgrps = fdf.groupby(['plantbarcode', 'timestamp', 'parameter'])
ds = list()
for grp, grpdf in dfgrps:
    # print(grpdf.parameter.unique())
    parameter = grpdf.parameter.unique()[0]
    tind = get_tind_seconds(
        parameter
    )  #tind is an integer representing seconds since start of experiment
    # print(tind)

    filenames = grpdf.filename.to_list()
    imgcol = io.ImageCollection(filenames)
    imgstack = imgcol.concatenate()  #imgstack is now 2x8x8 ndarray
    indf = grpdf.frame  #the 2 dim are frames Fp and Fmp
    # print(indf)
    arr = xr.DataArray(name=parameter,
                       data=imgstack,
                       dims=('frame', 'y', 'x'),
                       coords={
                    #        'frame': indf,
                           'parameter': [parameter,parameter]
                    #        'tind_s': [tind,tind]
                       },
                       attrs={
                           'jobdate': grpdf.timestamp.unique()[0],
                           'plantbarcode': grpdf.plantbarcode.unique()[0]
                       })
    # arr = arr.expand_dims(
    #     dims={'tind_s': tind}
    # )  #<- somehow I need to label each dataarray with another dimension assigning it the dim/coord `tind`
    ds.append(arr)

dstest = xr.concat(ds, dim='parameter')

the goal is to have a different file for each day, plantbarcode. so in this case 4 files. where the images are indexable by parameter and frame. tind_s is typically useful for plotting the summary stat of each image for each parameter so I'd like to make that dim/coord too - i'm not sure when to use which. looks like dim has to match the data coming in, so in this case 2 frames x 8x8 pixels.

original

im using a pandas dataframe to compile metadata from filenames (here are the first few entries)

    frameid plantbarcode    experiment  datetime    jobdate cameralabel filename    frame   parameter
4   5   A1  doi 2020-05-31 21:01:55 2020-06-01  PSII0   data/psII/A1-doi-20200531T210155-PSII0-5.png    Fp  FvFm
5   6   A1  doi 2020-05-31 21:01:55 2020-06-01  PSII0   data/psII/A1-doi-20200531T210155-PSII0-6.png    Fmp FvFm
6   7   A1  doi 2020-05-31 21:01:55 2020-06-01  PSII0   data/psII/A1-doi-20200531T210155-PSII0-7.png    Fp  t40_ALon
7   8   A1  doi 2020-05-31 21:01:55 2020-06-01  PSII0   data/psII/A1-doi-20200531T210155-PSII0-8.png    Fmp t40_ALon
8   9   A1  doi 2020-05-31 21:01:55 2020-06-01  PSII0   data/psII/A1-doi-20200531T210155-PSII0-9.png    Fp  t60_ALon
9   10  A1  doi 2020-05-31 21:01:55 2020-06-01  PSII0   data/psII/A1-doi-20200531T210155-PSII0-10.png   Fmp t60_ALon
...

then grouping and iterating through groups to create datasets, using skimage.io.ImageCollection to load multiple image files into nparray, and ultimate create xarray objects

import os
import cppcpyutils as cppc
import re
from skimage import io
import xarray as xr
import numpy as np
import pandas as pd

delimiter = "(.{2})-(.+)-(\d{8}T\d{6})-(.+)-(\d+)"

filedf = cppc.io.import_snapshots('data/psII', camera='psII', delimiter=delimiter)
filedf = filedf.reset_index().set_index('frameid')

pimframes_map = pd.read_csv('data/pimframes_map.csv',index_col = 'frameid')

filedf = filedf.join(pimframes_map, on = 'frameid').reset_index().query('frameid not in [3,4,5,6]')
dfgrps = filedf.groupby(['experiment', 'plantbarcode', 'jobdate', 'datetime', 'parameter'])

ds=list()
for grp, grpdf in dfgrps:
    # print(grpdf.parameter.unique())
    parameter = grpdf.parameter.unique()[0]
    tind = get_tind_seconds(parameter) #tind is an integer representing seconds since start of experiment
    # print(tind)

    filenames = grpdf.filename.to_list()
    imgcol = io.ImageCollection(filenames)
    imgstack = imgcol.concatenate() #imgstack is now 2x640x480 ndarray
    indf = grpdf.frame #the 2 dim are frames Fp and Fmp
    # print(indf)
    arr = xr.DataArray(name=parameter,
                      data=imgstack,
                      dims=('induction frame','y', 'x'),
                      coords={'induction frame': indf},
                      attrs={'plantbarcode': grpdf.plantbarcode.unique()[0],
                            'jobdate': grpdf.jobdate.unique()[0]})
    arr = arr.expand_dims(dims = {'tind_s': tind}) #<- somehow I need to label each dataarray with another dimension assigning it the dim/coord `tind`
    ds.append(arr)

the expand_dims line causes ValueError: dimensions ('dims',) must have the same length as the number of data dimensions, ndim=0

if i try to follow the second SO I linked above where I provide 'tind_s' as a coordinate it complains that there are too many relative to the dims.

ValueError: coordinate tind_s has dimensions ('tind_s',), but these are not a subset of the DataArray dimensions ('induction frame', 'y', 'x')

then i want to concat together where tind_s is a coordinate

dstest=xr.concat(ds[0:4], dim = 'tind_s')

another attempt

I did figure out that I can use np.expand_dims() on imgstack and then specify the extra dim and coord but it results in an array of nan. Also, the result from xr.concat() is a dataarray instead of a dataset so it can't be saved(?). is there a direct way in xarray to do this? I also converted the attributes to dims

dfgrps = filedf.groupby(
    ['experiment', 'plantbarcode', 'jobdate', 'datetime', 'parameter'])

dalist = list()
for grp, grpdf in dfgrps:
    print(grpdf.parameter.unique())
    parameter = grpdf.parameter.unique()[0]
    tind = get_tind_seconds(parameter)
    # print(tind)
    print(grpdf.plantbarcode.unique())
    print(grpdf.jobdate.unique()[0])

    filenames = grpdf.filename.to_list()
    imgcol = io.ImageCollection(filenames)
    imgstack = imgcol.concatenate()
    imgstack = np.expand_dims(imgstack, axis=0)
    imgstack = np.expand_dims(imgstack, axis=0)
    imgstack = np.expand_dims(imgstack, axis=0)
    indf = grpdf.frame  #xr.Variable('induction frame', grpdf.frame)
    # tind = xr.Variable('tind', [tind])
    # print(indf)
    arr = xr.DataArray(data=imgstack,
                       dims=('jobdate','plantbarcode', 'tind_s', 'induction frame', 'y',
                             'x'),
                       coords={
                           'plantbarcode': grpdf.plantbarcode.unique(),
                           'tind_s': [tind],
                           'induction frame': indf,
                           'jobdate': grpdf.jobdate.unique()}
    )
    dalist.append(arr)

ds = xr.concat(dalist, dim='jobdate')

after the for loop: print(arr)

<xarray.DataArray (jobdate: 1, plantbarcode: 1, tind_s: 1, induction frame: 2, y: 640, x: 480)>
array([[[[[[0, 0, 0, ..., 0, 0, 0],
           [1, 1, 0, ..., 0, 0, 0],
           [0, 0, 2, ..., 0, 0, 0],
           ...,
           [1, 0, 0, ..., 0, 1, 0],
           [1, 0, 0, ..., 0, 0, 1],
           [1, 0, 0, ..., 1, 1, 0]],

          [[0, 0, 0, ..., 0, 1, 1],
           [2, 2, 0, ..., 0, 0, 1],
           [2, 1, 1, ..., 0, 0, 0],
           ...,
           [0, 1, 0, ..., 1, 0, 1],
           [1, 0, 0, ..., 0, 1, 1],
           [0, 0, 0, ..., 0, 0, 0]]]]]], dtype=uint8)
Coordinates:
  * plantbarcode     (plantbarcode) object 'A2'
  * tind_s           (tind_s) int64 60
  * induction frame  (induction frame) object 'Fp' 'Fmp'
  * jobdate          (jobdate) datetime64[ns] 2020-06-03
Dimensions without coordinates: y, x

and print(ds)


print(ds)
<xarray.DataArray (jobdate: 18, plantbarcode: 2, tind_s: 3, induction frame: 2, y: 640, x: 480)>
array([[[[[[ 0.,  0.,  0., ...,  0.,  0.,  1.],
           [ 0.,  0.,  1., ...,  2.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           ...,
           [ 1.,  0.,  0., ...,  7.,  0.,  0.],
           [ 0.,  2.,  4., ...,  0.,  0.,  4.],
           [ 0.,  1.,  0., ...,  1.,  0.,  0.]],

          [[ 0.,  1.,  0., ...,  0.,  1.,  0.],
           [ 0.,  0.,  1., ...,  1.,  2.,  1.],
           [ 0.,  1.,  1., ...,  1.,  0.,  0.],
           ...,
           [ 1.,  2.,  2., ...,  0.,  1.,  1.],
           [ 1.,  1.,  1., ...,  0.,  1.,  0.],
           [ 0.,  0.,  2., ...,  0.,  0.,  1.]]],


         [[[nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
...
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan]]],


         [[[ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 1.,  1.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  2., ...,  0.,  0.,  0.],
           ...,
           [ 1.,  0.,  0., ...,  0.,  1.,  0.],
           [ 1.,  0.,  0., ...,  0.,  0.,  1.],
           [ 1.,  0.,  0., ...,  1.,  1.,  0.]],

          [[ 0.,  0.,  0., ...,  0.,  1.,  1.],
           [ 2.,  2.,  0., ...,  0.,  0.,  1.],
           [ 2.,  1.,  1., ...,  0.,  0.,  0.],
           ...,
           [ 0.,  1.,  0., ...,  1.,  0.,  1.],
           [ 1.,  0.,  0., ...,  0.,  1.,  1.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.]]]]]])
Coordinates:
  * plantbarcode     (plantbarcode) object 'A1' 'A2'
  * tind_s           (tind_s) int64 0 40 60
  * induction frame  (induction frame) object 'Fp' 'Fmp'
  * jobdate          (jobdate) datetime64[ns] 2020-06-01 ... 2020-06-03
Dimensions without coordinates: y, x

I don't understand where the array of nan comes from. It's also odd to me that whatever dim is used in concat this has a coord value for each entry (18 files in this case) even though they are not unique but the other dims only show as unique values.

in case anyone is willing to download a small dataset here is a link (sorry against the advice in the link, i will try to come up with a synthetic dataset that can be generated on the fly)

Dominik
  • 782
  • 7
  • 27
  • Turns out that using `np.expand_dims()` workflow means that the final step `xr.concat` results in a DataArray instead of a Dataset so you can't save it to file or use `xarray.Dataset.filter_by_attrs` – Dominik Oct 08 '20 at 23:21
  • if I use dstest.to_dataset(dim='tind_s') after using the np.expand_dims() workflow I get `ValueError: conflicting sizes for dimension 'tind_s': length 2 on 140 and length 3 on 0` – Dominik Oct 08 '20 at 23:28

2 Answers2

2

Your original code contains a subtle bug (typo) in arr.expand_dims(dims={'tind_s': tind}): I guess you want dim instead of dims, the latter being interpreted by xarray as a new dimension label (see doc). Also, the tind is here used as the number of elements to create along the new dimension, which is probably not what you want either.

Your other attempt (i.e., expand data dimensions before creating the DataArray) is a better approach IMO, but it can be further improved. Given that you have multiple labels along the same, concatenated dimension, I suggest you to create a multi-index and assign it to the concatenated dimension, i.e., something like

import numpy as np
import pandas as pd
import xarray as xr


da_list = []
props = []
prop_names = ['experiment', 'plantbarcode', 'tind']

for i in range(10):
    tind = i
    indf = ['Fp', 'Fmp']
    data = np.ones((2, 640, 480)) * i
    
    da = xr.DataArray(
        data=data[None, ...],
        dims=('props', 'frame', 'y', 'x'),
        coords={'frame': indf}
    )

    props.append((f'experiment{i}', i*2, i))
    da_list.append(da)


prop_idx = pd.MultiIndex.from_tuples(props, names=prop_names)

da_concat = xr.concat(da_list, 'props')
da_concat.coords['props'] = prop_idx

which gives:

<xarray.DataArray (props: 10, frame: 2, y: 640, x: 480)>
array([[[[0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         ...,
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.]],

        [[0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         ...,
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.]]],


       [[[1., 1., 1., ..., 1., 1., 1.],
         [1., 1., 1., ..., 1., 1., 1.],
         [1., 1., 1., ..., 1., 1., 1.],
...
         [8., 8., 8., ..., 8., 8., 8.],
         [8., 8., 8., ..., 8., 8., 8.],
         [8., 8., 8., ..., 8., 8., 8.]]],


       [[[9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.],
         ...,
         [9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.]],

        [[9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.],
         ...,
         [9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.]]]])
Coordinates:
  * frame         (frame) <U3 'Fp' 'Fmp'
  * props         (props) MultiIndex
  - experiment    (props) object 'experiment0' 'experiment1' ... 'experiment9'
  - plantbarcode  (props) int64 0 2 4 6 8 10 12 14 16 18
  - tind          (props) int64 0 1 2 3 4 5 6 7 8 9
Dimensions without coordinates: y, x
Benoît
  • 36
  • 2
  • Thanks! but in order to save this I think I need a dataset, right? its not clear to me how to make it one. the docs say datasets are dict-like collections of dataarrays. how would you combine dataarrays for each experiment into a dataset? – Dominik Nov 20 '20 at 20:55
  • 1
    Converting a DataArray to a Dataset is pretty simple: `da_concat.to_dataset(name="foo")`. – Ryan Nov 21 '20 at 01:44
  • 1
    You can also directly save a DataArray on disk (it implicitly does what @Ryan mentions). In this case, you need one extra step as multi-indexes are not supported: `da_concat.reset_index('props').to_netcdf('test.nc')`. Then to reload the file: `xr.open_dataset('test.nc').set_index(props=['experiment', 'plantbarcode', 'tind'])`. – Benoît Nov 22 '20 at 09:40
0

I saw your question on the xarray mailing list. It's hard to debug this question because it's complex and depends on your data. It would be great if you could simplify it a bit and perhaps use synthetic data rather than your data files--see https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports for advice on this.

It would also be helpful if you shared the output of print(arr) so we can understand the content and structure of your DataArrays.

Ryan
  • 766
  • 6
  • 13