4

I want to replace None entries in a specific column in Pandas with an empty list.

Note that some entries in this column may already have an empty list in them, and I don't want to touch those.

I have tried:

indices = np.equal(df[col],None)
df[col][indices] = []

and

indices = np.equal(df[col],None)
df[col][indices] = list()

but both solutions fail with:

ValueError: Length of replacements must equal series length

Why? How can I update those specific rows with an empty list?

Josh
  • 11,979
  • 17
  • 60
  • 96
  • in general storing lists in dataframes is not a good idea. What problem are you trying to solve? – Jeff Apr 22 '14 at 18:55
  • Thanks @Jeff. I need to temporarily fill a column in a large DataFrame with lists. I understand that this is not ideal and that, in my case, I could expand those lists into rows of the original dataframe (one row per value of the list), but for now, I need to store these items in the DataFrame in a list within each row, and am wondering how to do this in Pandas. – Josh Apr 22 '14 at 19:00
  • By the way, why is storing lists in a dataframe not a good idea? What about dicts? – exp1orer Apr 22 '14 at 23:25
  • You should ask @Jeff that question. By the way, the question remains unanswered (see the error that I get with exp1orer's solution) – Josh Apr 23 '14 at 12:18
  • 3
    @exp1orer only base types (e.g. int,float,bools) or things that are represented as base types strings (as chars) and datetimes (translated to ints) are efficiently stored/manipulated by numpy (as they are stored/accessed in c). objects (e.g. a list/dict) are possible, but are stored as pointers to the python object. So it will work but will not be very efficient, so should should avoid ``object`` dtypes at all costs (with the exception of strings which though ``object`` dtype are efficient because they have numpy dtype support). – Jeff Apr 23 '14 at 12:39
  • @Jeff - Do you happen to know why, even after epo1orer edited his answer, his approach does not work? – Josh Apr 23 '14 at 23:44

2 Answers2

6

Using endemic lists is not allowed on assignment and is not recommended to do this at all.

You can do it if you create from scratch

In [50]: DataFrame({ 'A' : [[],[],1]})
Out[50]: 
    A
0  []
1  []
2   1

[3 rows x 1 columns]

The reason this is not allowed is that without indicies (e.g. say in numpy), you can do something like this:

In [51]: df = DataFrame({ 'A' : [1,2,3] })

In [52]: df.loc[df['A'] == 2] = [ 5 ]

In [53]: df
Out[53]: 
   A
0  1
1  5
2  3

[3 rows x 1 columns]

You can do an assignment where the length of the True values in the mask are equal to the length of the list/tuple/ndarray on the rhs (e.g. the value you are setting). Pandas allows this, as well as a length that is exactly equal to the lhs, and a scalar. Anything else is expressly disallowed because its ambiguous (e.g. do you mean to align it or not?)

For example, imagine:

In [54]: df = DataFrame({ 'A' : [1,2,3] })

In [55]: df.loc[df['A']<3] = [5]
ValueError: cannot set using a list-like indexer with a different length than the value

A 0-length list/tuple/ndarray is considered an error not because it can't be done, but usually its user error, its unclear what to do.

Bottom line, don't use lists inside of a pandas object. Its not efficient, and just makes interpretation difficult / impossible.

Jeff
  • 125,376
  • 21
  • 220
  • 187
  • 2
    What is the correct way to store a list or tuple in a DataFrame or Series? If possible it is probably better to add a column for each potential value in the list, but what if its a trait that is variable in length for each row. I realize it may be inefficient to put in a DataFrame, but if I have an entire DataFrame of beautifully indexed data, what do I do with a column of lists that shares the same index? I can make a new Q if you want...thanks – agartland Dec 09 '14 at 20:01
  • you make another frame and index them the same. Just like you would in a database like schema. Join when needed. – Jeff Dec 09 '14 at 21:22
1

Edit: Preserved my original answer below, but I put it up without testing it and it actually doesn't work for me.

import pandas as pd
import numpy as np
ser1 = pd.Series(['hi',None,np.nan])
ser2 = pd.Series([5,7,9])
df = pd.DataFrame([ser1,ser2]).T

This is janky, I know. Also, apparently the DataFrame constructor (but not the Series constructor) coerces None to np.nan. No idea why.

df.loc[1,0] = None

So now we have

    0     1
0   'hi'  5
1   None  7
2   NaN   9

df.columns = ['col1','col2']
mask = np.equal(df['col1'], None)
df.loc[mask, 'col1'] = []

But this doesn't assign anything. The dataframe looks the same as before. I'm following the recommended usage from the docs and assigning base types (strings and numbers) works. So for me the problem is assigning objects to dataframe entries. No idea what's up.


(Original answer)

Two things:

  1. I'm not familiar with np.equal but pandas.isnull() should also work if you want to capture all null values.
  2. You are doing what is called "chained assignment". I don't understand the problem fully but I know it doesn't work. In the docs.

Try this:

mask = pandas.isnull(df[col])
df.loc[mask, col] = list()

Or, if you only want to catch None and not np.nan:

mask = np.equal(df[col].values, None) 
df.loc[mask, col] = list()

Note: While pandas.isnull works with None on dataframes, series, and arrays as expected, numpy.equal only works as expected with dataframes and arrays. A pandas Series of all None will not return True for any of them. This is due to None only selectively behaving as np.nan See BUG: None is not equal to None #20442

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
exp1orer
  • 11,481
  • 7
  • 38
  • 51
  • Wouldn't that potentially affect entries that may hold `NaN` as well? – Josh Apr 22 '14 at 18:57
  • Yep, I thought that was desired behavior. Edited to reflect your question. – exp1orer Apr 22 '14 at 19:01
  • 1
    The thing to remember about why chained assignment doesn't reliably work is that for any indexing/slicing operation on a numpy array, you can't always tell when it's made a copy or a view. If it's a view, you can assign to it, which means chained assignment would work. If it's a copy (e.g., fancying indexing) assignment will not work as expected. – Phillip Cloud Apr 22 '14 at 20:11
  • Will this create a different list for each row? Or will it be using the same list? (i.e. would adding items to one of the lists add items to the other lists?) – Josh Apr 22 '14 at 21:28
  • 1
    and by the way, this gives me an error: `ValueError: Must have equal len keys and value when setting with an iterable` – Josh Apr 22 '14 at 21:29
  • Huh, I'm actually running into a different problem. I'll edit my answer to reflect the fact that I haven't answered your question. – exp1orer Apr 23 '14 at 17:28