1

I am not sure if this is a bug of dask or a feature of python. Simple example:

data = pd.DataFrame({'tags': [['dog'], ['cat', 'red'], ['cat'], ['cat', 'red'], ['cat', 'red'], ['dog', 'red']]})
print data

          tags
0       [dog]
1  [cat, red]
2       [cat]
3  [cat, red]
4  [cat, red]
5  [dog, red]

I want to create "hot-columns" for each tag

tags = ['cat', 'dog', 'red']

using dask:

data = dd.from_pandas(data, npartitions=4)

for tag in tags:
    data[tag] = data.tags.apply(lambda x: tag in x, meta=(tag, bool))

the result is wrong:

print data.compute()
         tags    cat    dog    red
0       [dog]  False  False  False
1  [cat, red]   True   True   True
2       [cat]  False  False  False
3  [cat, red]   True   True   True
4  [cat, red]   True   True   True
5  [dog, red]   True   True   True

is seems that the lambda is always bounded to the last tag in the loop (red). If I unroll the loop manually it works correctly.

Using plain pandas I don't have this problem.

Partial solution

def is_in(items, value):
    return value in items

for tag in tags:
    data[tag] = data.tags.apply(is_in, value=tag, meta=(tag, bool))

I don't like it very much since it force the order of the argument to be quite unnatural. By the way I am not sure to have understood the original problem.

Ruggero Turra
  • 16,929
  • 16
  • 85
  • 141

1 Answers1

1

The answer is here: What do (lambda) function closures capture? and it is about the lexical scoping of python.

Better solution: use default values with lambda

for tag in tags:
    data[tag] = data.tags.apply(lambda x, t=tag: t in x, meta=(tag, bool))
Ruggero Turra
  • 16,929
  • 16
  • 85
  • 141