1

I am a beginner in pandas and numpy

I am working with the dataset mentioned in this paper,

I have several images, each image is described by certain visual descriptors like CM, CN, GLRLM (the meaning of these descriptors is not important) and these visual descriptors are basically lists.

So my datastructure is:

idsDict = {
    12312: {
         "CM": [2, 3, 1, 5, 1],
         "CN" : [1, 4, 5, 1]
    },
    21367: {
         "GLRLM": [9, 4, 1, 4, 5, 12, 67, 12],
         "CM"   : [1, 6, 8, 1, 34]
    }
}

12312, 21367 are the image ids of the images

I want convert this to either to tensor/numpy-array(3D)/pandas-dataframe(3D) so that I can find distances between images based on the descriptors.

Basically the structure of the tensor/numpy-array(3D)/pandas-dataframe(3D) will be a cuboid with rows as the image ids, columns as the descriptors and the z-axis will contain the values of the descriptors

I have read,

Construct pandas DataFrame from items in nested dictionary

Pandas dataframe to dict of dict

tel
  • 13,005
  • 2
  • 44
  • 62
coda
  • 2,188
  • 2
  • 22
  • 26
  • can you post the expected output structure? – Vivek Kalyanarangan Nov 15 '18 at 09:14
  • The issues I think will be that the length of your descriptors are different, and that different images have different descriptors. That kind of heterogeneity makes using numpy or pandas tricky. – tel Nov 15 '18 at 09:22
  • 1
    I think you need to fix the syntax of your data structure. The top line should probably be `idsDict = {`, and there seems to be an unnecessary level of nested curly brackets. – tel Nov 15 '18 at 09:34
  • @tel fixed the extra { – coda Nov 15 '18 at 09:54
  • @VivekKalyanarangan do you want a diagram? Is the last paragraph nebulous in explaining the structure? – coda Nov 15 '18 at 09:57

1 Answers1

0

In terms of computational speed you'd probably be best off using Numpy:

import numpy as np

idsDict = {
    12312: {
      "CM": [2, 3, 1, 5, 1],
      "CN" : [1, 4, 5, 1]
    },
    21367: {
      "GLRLM": [9, 4, 1, 4, 5, 12, 67, 12],
      "CM"   : [1, 6, 8, 1, 34]
    }
}

# loop through once to figure out size of final data structure
dscr = {}
maxlen = 0
for d in idsDict.values():
    for descName,desc in d.items():
        if descName not in dscr:
            dscr[descName] = np.obj2sctype(desc[0]) if len(desc) else np.int64
        if len(desc) > maxlen:
            maxlen = len(desc)

# allocate a masked structured array of the right shape and dtype
dtype = np.dtype(sorted(dscr.items()))
_data3d = np.empty((len(idsDict), maxlen), dtype=dtype)
data3d = np.ma.array(_data3d, mask=True)

# copy the data over the array
for d,drow in zip(idsDict.values(), data3d):
    for descName,desc in d.items():
        drow[descName][:len(desc)] = desc

print(data3d.dtype.names,'\n')
print(data3d.T)

Which outputs:

('CM', 'CN', 'GLRLM')

[[(2.0, 1.0, --) (1.0, --, 9.0)]
 [(3.0, 4.0, --) (6.0, --, 4.0)]
 [(1.0, 5.0, --) (8.0, --, 1.0)]
 [(5.0, 1.0, --) (1.0, --, 4.0)]
 [(1.0, --, --) (34.0, --, 5.0)]
 [(--, --, --) (--, --, 12.0)]
 [(--, --, --) (--, --, 67.0)]
 [(--, --, --) (--, --, 12.0)]]

Unfortunately, there's no good way to keep the image ids in the Numpy structured array. If you need those, you can use Pandas instead. Here's how you could squeeze all your data in a single Pandas 3D dataframe:

import pandas as pd

idsDict = {
    12312: {
      "CM": [2, 3, 1, 5, 1],
      "CN" : [1, 4, 5, 1]
    },
    21367: {
      "GLRLM": [9, 4, 1, 4, 5, 12, 67, 12],
      "CM"   : [1, 6, 8, 1, 34]
    }
}

# loop through once to figure out size of final data structure
descNames = set()
maxlen = 0
for d in idsDict.values():
    for descName,desc in d.items():
        descNames.add(descName)
        if len(desc) > maxlen:
            maxlen = len(desc)

# pad data
padDesc = maxlen*[np.nan]
for d in idsDict.values():
    for desc in d.values():
        dlen = len(desc)
        if dlen < maxlen:
            desc.extend((maxlen - dlen)*[np.nan])
    for descName in (n for n in descNames if n not in d):
        d[descName] = padDesc

data3d = pd.concat([pd.DataFrame(d) for id,d in idsDict.items()], keys=idsDict.keys())
print(data3d)

This outputs:

           CM   CN  GLRLM
12312 0   2.0  1.0    NaN
      1   3.0  4.0    NaN
      2   1.0  5.0    NaN
      3   5.0  1.0    NaN
      4   1.0  NaN    NaN
      5   NaN  NaN    NaN
      6   NaN  NaN    NaN
      7   NaN  NaN    NaN
21367 0   1.0  NaN    9.0
      1   6.0  NaN    4.0
      2   8.0  NaN    1.0
      3   1.0  NaN    4.0
      4  34.0  NaN    5.0
      5   NaN  NaN   12.0
      6   NaN  NaN   67.0
      7   NaN  NaN   12.0
tel
  • 13,005
  • 2
  • 44
  • 62