2

The following code transforms a given pandas column FEAT into a new, binary feature named STREAM. The program works as long as there are no NaN values in the original dataframe. If that is the case, the following exception occurs: ValueError: Length of values does not match length of index. I need to push the NaN values to the new column. Is it doable? Here is the code option that fails:

import pandas as pd
import numpy as np
data = {
    'FEAT': [8, 15, 7, np.nan, 5, 2, 11, 15]
}
customer = pd.DataFrame(data)
customer = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David', 'Bob', 'Sally', 'Mia', 'Luis'])
#create binary variable STREAM 0:mainstream 1:avantgarde
stream_0 = [1, 3, 5, 8, 10, 12, 14]
stream_1 = [2, 4, 6, 7, 9, 11, 13, 15]
# convert FEAT to list_0
list_0 = customer['FEAT'].values.tolist()
# create a list of length = len(customer) whose elements are:
#  0 if the value of 'FEAT' is in stream_0
#  1 if the value of 'FEAT' is in stream_1
L = []
for i in list_0:
    if i in stream_0:
        L.append(0)
    elif i in stream_1:
        L.append(1)
# convert the list to a new column of customer df
customer['STREAM'] = L
print(customer)
ALollz
  • 57,915
  • 7
  • 66
  • 89
joseph pareti
  • 97
  • 1
  • 9

1 Answers1

2

The issue is you are missing an else block so when a value (like NaN) is in neither stream_0 nor stream_1 you do nothing which then causes L to have fewer elements than the number of rows in customer.

Looping here is unnecessary, np.select can handle the column creation. The default argument will handle the else block.

customer['STREAM'] = np.select([customer.FEAT.isin(stream_0), customer.FEAT.isin(stream_1)],
                                [0, 1], default=np.NaN)

        FEAT  STREAM
June     8.0     0.0
Robert  15.0     1.0
Lily     7.0     1.0
David    NaN     NaN
Bob      5.0     0.0
Sally    2.0     1.0
Mia     11.0     1.0
Luis    15.0     1.0

You could also map the few values, everything not in either is NaN

d = {key: value for l,value in zip([stream_0, stream_1], [0,1]) for key in l}
customer['STREAM'] = customer['FEAT'].map(d)

The dict uses a comprehension to create the key value pairs. For every key in stream_0 we assign it a value of 0, for every key in stream_1 we assign a value of 1. The comprehension is a bit complicated, so a more easy to understand method which accomplishes the same would be to create each dictionary separately, then combine them.

d_1 = {k: 0 for k in stream_0}
d_2 = {k: 1 for k in stream_1}
d = {**d_1, **d_2}  # Combine
#{1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 1, 7: 1,
# 8: 0, 9: 1, 10: 0, 11: 1, 12: 0, 13: 1, 14: 0, 15: 1}
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • the second solution WORKS. However, it is very cryptic. Can you explain a bit more or provide an easier-to-understand code. Thanks – joseph pareti May 13 '20 at 16:13
  • @josephpareti I added some explanation, and also a more straight forward way to create the dictionary. It's a bit more typing but I think clearer. [`map`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) uses that dict to transform the values – ALollz May 13 '20 at 16:48
  • thank-you, I understand more the option with the 2 dictionaries, but I still do not understand d = {**d_1, **d_2} – joseph pareti May 13 '20 at 18:12
  • @josephpareti See https://stackoverflow.com/questions/38987/how-do-i-merge-two-dictionaries-in-a-single-expression-in-python. In python 3.5+ it's one very concise way to merge dictionaries otherwise you can use `dict.update` – ALollz May 13 '20 at 18:57