2

I would like to copy the values from one column (PERSONAL INFORMATION) to the other (VARIABLE) based on multiple conditions as listed below

  1. Check whether 'PERSONAL INFORMATION' column starts with digits

  2. Check whether the corresponding row value in 'VARIABLE' column is Nan

  3. Check whether the previous row value in 'VARIABLE' column is not Nan (Here 'gender' is not nan but there might be cases when it's Nan)

Once all the conditions are met, I would like to copy over the values from 'PERSONAL INFORMATION' COLUMN TO 'VARIABLE' COLUMN

Please find below how the input data looks like

df = pd.DataFrame({'PERSONAL INFORMATION':['Gender','1.Male','2.Female','Ethnicity','1.Chinese','2.Indian','3.Eurasian','Marital Status','1.Single','2.Married','3.Divorced'], 'VARIABLE':['gender', np.nan, np.nan,'ethn',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})

Marital status options shouldn't be copied to "Variable" column as the previous row is Nan here.

I was using np.where option but wasn't sure how to check the previous row value. I don't wish to use for loop.

df['VARIABLE'] = np.where((df['PERSONAL 
INFORMATION'].str.startswith(('\d+')) == True) & (df['VARIABLE'].isna() == 
True) & 3RD CONDITION FOR PREVIOUS ROW CHECK

Can you please help me to find out how to check the previous row value for nan. If it's Nan i don't wish to copy the data. If it's not Nan, then the data has to be copied

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
The Great
  • 7,215
  • 7
  • 40
  • 128
  • @SELVA - Can you add some sample data for verify `3RD CONDITION FOR PREVIOUS ROW CHECK` ? – jezrael May 29 '19 at 06:06
  • In the above example, the previous row has value 'gender'. As it's not nan, we can copy the data from PI column to Variable column. If you are asking for some 10-15 rows, can you let me know how can I upload them here – The Great May 29 '19 at 06:08
  • @SELVA - try copy them like text with 4 spaces in start of rows for correct formating. Try like you know, I help then with formatting. – jezrael May 29 '19 at 06:10
  • @SELVA - Or try edit `df = pd.DataFrame({'PERSONAL INFORMATION':['Gender','1.Male','2.Female'], 'VARIABLE':['gender', np.nan, np.nan]})` – jezrael May 29 '19 at 06:12
  • @jezrael - Personal_Information,Variable Gender,gender 1. Male,na 2. Female,na Ethnicity, ethn 1. Chinese, na 2. Malay, na 3. Indian, na 4. Others: please specify ___ Current Marital Status,a6marrstat 1. Never married, na 2. Currently married, na 3. Separated but not divorced, na 4. Divorced,na 5. Widowed, na 6. Separated/divorced, na 7. Single, na 888. Refused, na – The Great May 29 '19 at 06:15
  • @SELVA - Super, can you add this to `df = pd.DataFrame({'PERSONAL INFORMATION':['Gender','1.Male','2.Female'], 'VARIABLE':['gender', np.nan, np.nan]})` ? thank you. – jezrael May 29 '19 at 06:16
  • df = pd.DataFrame({'PERSONAL INFORMATION':['Gender','1.Male','2.Female','Ethnicity','1.Chinese','2.Indian','3.Eurasian','Marital Status','1.Single','2.Married','3.Divorced'], 'VARIABLE':['gender', np.nan, np.nan,'ethn',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}) – The Great May 29 '19 at 06:20
  • Marital status options shouldn't be copied to "Variable" column as the previous row is Nan here – The Great May 29 '19 at 06:21
  • Answer was edited. – jezrael May 29 '19 at 06:27

1 Answers1

2

I believe you need Series.str.contains with ^ for start of string and \d for digit, then create groups by testing values not contains digits with Series.cumsum and GroupBy.transform with first:

m1 = df['PERSONAL INFORMATION'].str.contains('^\d')
s = df.groupby((~m1).cumsum())['VARIABLE'].transform('first')

mask = m1 & df['VARIABLE'].isna() & s.notna()

df.loc[mask, 'VARIABLE'] = df.loc[mask, 'PERSONAL INFORMATION']
print (df)
   PERSONAL INFORMATION    VARIABLE
0                Gender      gender
1                1.Male      1.Male
2              2.Female    2.Female
3             Ethnicity        ethn
4             1.Chinese   1.Chinese
5              2.Indian    2.Indian
6            3.Eurasian  3.Eurasian
7        Marital Status         NaN
8              1.Single         NaN
9             2.Married         NaN
10           3.Divorced         NaN

Details:

print ((~m1).cumsum())
0     1
1     1
2     1
3     2
4     2
5     2
6     2
7     3
8     3
9     3
10    3
Name: PERSONAL INFORMATION, dtype: int32

print (df.groupby((~m1).cumsum())['VARIABLE'].transform('first'))
0     gender
1     gender
2     gender
3       ethn
4       ethn
5       ethn
6       ethn
7        NaN
8        NaN
9        NaN
10       NaN
Name: VARIABLE, dtype: object
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Can you please help me understand the application of tilde operator? I mean 0's should be inverted to 1's and vice versa but here I see 7 and 8 being converted to 3. Help me understand how this works? – The Great May 29 '19 at 06:39
  • 1
    @SELVA - it change mask filled by `True` and `False` like [this](https://stackoverflow.com/q/15998188) – jezrael May 29 '19 at 06:40