Why does adding a second column to a dataframe prohibit usage of loc to set an array as a value?

Question

If I create a dataframe with a single column

a = pd.DataFrame({'x': [np.array([1,2,3,4]), np.array([1,2,3])]})

then reassign the value of the first row

a.loc[0, 'x'] = a.loc[0, 'x']

a.loc[0, 'x'] is unchanged. All is well!

However, if I add a second column

a = pd.DataFrame({'x': [np.array([1,2,3,4]), np.array([1,2,3])], 'y':[1,2]})

then a.loc[0, 'x'] = a.loc[0, 'x'] throws the error:

ValueError: Must have equal len keys and value when setting with an iterable

Can someone explain what I'm doing wrong here? I've found a solution here: i.e. use set_value instead of loc, but I'd like to know why loc doesn't work.

Also, is this an appropriate usage of the pandas DataFrame? I have a bunch of vectors x that I would like to associate with some other variables and an index, and a DataFrame seemed to be the best way to store them and run operations on them (df.apply works very well to perform operations in bulk on these arrays!).

You have a bunch of `numpy.arrays`. Generally, if I see a `pandas.DataFrame` of objects like `np.ndarray` or `list` or `dict`, my instinct is to say this isn't the right data structure. `df.apply` is a last-resort utility method, which is essentially a python for-loop with some extra overhead. What are you trying to accomplish, exactly? — juanpa.arrivillaga, Jun 05 '17 at 20:51
Anyway, your error is occurring because there is ambiguity as to what exactly you want. `pandas` isn't really meant to work with sequences/iterables as values because of this ambiguity: are you trying to set the *element* as the sequence, or use the *sequence* to set multiple elements, which is a common operation (e.g. `a.loc[:, 'x'] = ['foo','bar']`. `pandas` is assuming the latter, and it throws an error because the dimensions don't work. You could use `a.set_value(0, 'x', a.loc[0,'x'])` which removes the ambiguity. — juanpa.arrivillaga, Jun 05 '17 at 20:54
Thanks! That's exactly the explanation I was looking for. I have a bunch of vectors defined by (x,y) data, and each vector is associated with a number of other properties. I want to filter and sort by these properties, then plot the filtered vectors or run calculations on these vectors using the properties. `pandas` seemed like the best (least prone to error) way to work with the bulk data, but if you have alternatives that you think may work more fluidly, I'd love to hear them! — Robert Yi, Jun 05 '17 at 21:05
A list of `namedtuples`/`tuples`? I'm not sure exactly what you want. Anyway, although `pandas` isn't really meant for working with these sorts of elements, it *does* work if you don't mind these sorts of caveats (like the one you've encountered already). — juanpa.arrivillaga, Jun 05 '17 at 21:08

score 0 · Answer 1 · answered Jun 05 '17 at 23:23

0

You can use set_value:

a.set_value(0,'x',a.loc[0, 'x'])
Out[619]: 
              x  y
0  [1, 2, 3, 4]  1
1     [1, 2, 3]  2

answered Jun 05 '17 at 23:23

Allen Qin

19,507
8
51
67

Why does adding a second column to a dataframe prohibit usage of loc to set an array as a value?

1 Answers1