4

I have a dataframe and I want to populate 'column3' with value of column 'name' if column 'gender' is empty, else with value of column 'gender'

vals = {
    'name' : ['n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7'],
    'gender' : ['', '', '', 'f',  'f', 'c', 'c'],
    'age' : [39, 12, 27, 13, 36, 29, 10]
}
df4 = pd.DataFrame(vals)
df4['column3'] = df4['name'] if len(df4['gender']) == 0 else df4['gender']

The result is that column3 has only values taken from 'gender'. I've tried the following statements:

df4['column3'] = np.where(df4['gender'].empty, df4['name'],df4['gender'])
df4['column3'] = df4['name'] if df4['gender'].empty else df4['gender']

Same results..so I am thinking that my code is not able to identify an empty string in a Python Dataframe. What am I missing?

jpp
  • 159,742
  • 34
  • 281
  • 339
Nik
  • 107
  • 1
  • 2
  • 10
  • 2
    Use `df4['column3'] = np.where(df4.gender.eq(''), df4.name, df4.gender)` – Zero Mar 23 '18 at 10:01
  • @Zero ok, it works :) Please create the answer and explain why my code isn't correct – Nik Mar 23 '18 at 10:09
  • Please check my answer, and you will know the operation you did is not actually apply on each row, you should use apply to do the similar logic with axis = 1 – Menglong Li Mar 23 '18 at 10:11
  • Don't use `lambda` for this. Your logic is easily vectorisable. – jpp Mar 23 '18 at 10:17

3 Answers3

6

Your numpy.where construct is perfectly fine to use.

The issue you are facing is how to test a column versus an empty string. The answer is just check equality versus ''.

This is straightforward to implement:

df4['column3'] = np.where(df4['gender'] == '', df4['name'], df4['gender'])

pd.Series.empty tests if the series has no items, i.e. no rows, not whether its elements are empty strings.

Example

import pandas as pd, numpy as np

vals = {
    'name' : ['n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7'],
    'gender' : ['', '', '', 'f',  'f', 'c', 'c'],
    'age' : [39, 12, 27, 13, 36, 29, 10]
}
df4 = pd.DataFrame(vals)

df4['column3'] = np.where(df4['gender'] == '', df4['name'], df4['gender'])

#    age gender name column3
# 0   39          n1      n1
# 1   12          n2      n2
# 2   27          n3      n3
# 3   13      f   n4       f
# 4   36      f   n5       f
# 5   29      c   n6       c
# 6   10      c   n7       c
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    ok. You are right. It works. Answer accepted because you have provided explanation. – Nik Mar 23 '18 at 10:35
1

There are many ways but I feel the following is most succinct:

idx = lambda x: x.gender==''
df4.loc[idx, 'column3'] = df4.loc[idx, 'name']
df4.column3= df.column3.fillna(df4.gender)
Little Bobby Tables
  • 4,466
  • 4
  • 29
  • 46
  • @jpp I dont think you understand what is going on here. It is vectorised. The lambda is taking a whole dataframe `x` and doing a boolean comparison on the column `gender`. `loc` then uses this as a vecotrised index. This stops me from repeatedly filtering inside the `loc`. It also means that I don't create a potentially large `idx` object by actually creating the bool index. See [here](https://stackoverflow.com/questions/37102824/why-does-not-work-pandas-df-loc-lambda) for more information. – Little Bobby Tables Mar 23 '18 at 10:30
  • @jpp did you read my explanation? I am not using the `lambda` inside an apply. Instead of spending your time being insulting, spend some time to read my explanation. – Little Bobby Tables Mar 23 '18 at 10:35
  • @jpp Yes. In `lambda: x: x.gender` the `x` is a DataFrame and therefore `x.gender` a Series. i.e. it is vectorised. It is not using the lambda function on each element of the DataFrame or Series. – Little Bobby Tables Mar 23 '18 at 10:39
  • @jpp I wrote this above in my first explaination: "It also means that I don't create a potentially large `idx` object by actually creating the bool index" as a Series. If the original DataFrame was large then this would be a large `idx` object. – Little Bobby Tables Mar 23 '18 at 10:46
  • @jpp your point was that you don't like the use of element-wise lambda functions. This is not that. – Little Bobby Tables Mar 23 '18 at 10:47
  • 1
    I don't like `lambda` functions anywhere if they don't serve a purpose :). I dispute your stated purpose. – jpp Mar 23 '18 at 10:48
1

I prefer using pandas alone to do this instead of introducing numpy:

df4['column3'] = df4[['gender', 'name']].apply(lambda x: x[0] if x[0] else x[1], axis=1)
Menglong Li
  • 2,177
  • 14
  • 19
  • 1
    @MenglongLi I didnt. I upvoted you. I was asking why some else downvoted you without leaving a comment. This is a good answer. – Little Bobby Tables Mar 23 '18 at 10:21
  • @jpp you should explain your reason before downvoting. This allows users to correct mistakes or change their way of thinking. – Little Bobby Tables Mar 23 '18 at 10:23
  • [**Why is pandas apply lambda slower than loop here?**](https://stackoverflow.com/questions/47749018/why-is-pandas-apply-lambda-slower-than-loop-here) – jpp Mar 23 '18 at 10:34