0

I would like to fill single cells of a Pandas Dataframe with 1D-arrays (or lists, whatever works). What I am trying to accomplish, is running over the values in one column and performing some action, that results in an array, which I would like to store in another column of that DataFrame. I think, it is not possible (or very inconvenient) to do this with a lambda function because I think it is way too long for an in-line loop.

For every value of OvenTemp, another function "sample" creates a lot of 3D-coordinates, where NNdist(samples) gives a list of the Next-Neighbour-distances for each of the 3D-coordinates. There is some averaging over different samples, which should not make a big difference for the coding; in the end, there is a long 1D-numpy-array of float numbers, that is holding all the NN-distances. This list I would like to store now in column "NNDistList" and the respective row for the OvenTemp. The NNDistList will always have a different length, since the external function "sample" is creating ensembles of coordinates of different magnitude.

The code looks like this:

distances=pd.DataFrame({'OvenTemp':OvenTemp,'NNDistList':np.zeros(len(OvenTemp))})
    for i in distances.OvenTemp:
        avgrun=1
        avgresult=np.array([])
        while avgrun<=averaging:
            samples=sample(df.Radius[distances.OvenTemp==i].values[0],df.Dopants[df.OvenTemp==i].values[0])
            NNresult=NNdist(samples)
            avgresult=np.append(avgresult,NNresult)
            avgrun+=1
        distances.loc[distances.OvenTemp==i,'NNDistList']=avgresult

Running this code, I obtain the error "Must have equal len keys and value when setting with an iterable".

I tried to use a wrapper for the last line, as suggested here, so it would read

distances.loc[distances.OvenTemp==i,'NNDistList']=[avgresult]

but the error stays the same. The example given in the answer of the old question, that is creating a new DataFrame and filling it with one numpy-array as one cell, works for me. However, adding/replacing one element in an already existing DataFrame by a numpy-array (my code) is still producing this error.

I would appreciate any workaround for this problem, but also any advice on how this could be in general coded in a more efficient way.

Thank you very much, lepakk

Lepakk
  • 419
  • 1
  • 6
  • 20

1 Answers1

1

In order to be able to assign an array (or any other iterable) to a cell, this column should be of object type. Then use at instead of loc to select the cell to assign the array to.

import pandas as pd
distances = pd.DataFrame({'OvenTemp': [1,2,3], 'NNDistList': None})
distances.info()
#NNDistList    0 non-null object

distances.at[distances[distances.OvenTemp==1].index[0], 'NNDistList'] = pd.np.array([2,3,4])

print(distances)
#  NNDistList  OvenTemp
#0  [2, 3, 4]         1
#1       None         2
#2       None         3
type(distances.iat[0,0])
#<class 'numpy.ndarray'>
Stef
  • 28,728
  • 2
  • 24
  • 52
  • Thank you very much, it worked. Can you tell me, what makes the difference between loc and at? When I try to read the cell, it's working with loc as well. – Lepakk Dec 05 '19 at 08:59
  • I can't really explain why loc doesn't work in this case, but `at` is for accessing a *single* cell whereas `loc` is for accessing a *group of cells*. So with `loc` it's natural for pandas to assume that the individual values of the array should be assigned to the individual cells in the group (which is, indeed, the usual use case) and there's no way telling pandas that you want it the other way (i.e. all values into one cell) in this particular case. – Stef Dec 05 '19 at 11:46
  • Thank you again. I searched a bit more for information, and in my case, where I want to access one cell at a time, `at` seems to be a lot faster than `loc` [(See this thread)](https://stackoverflow.com/questions/37216485/pandas-at-versus-loc). – Lepakk Dec 06 '19 at 09:19