1

This is the following subset of a dataset:

A  B    C         D         R        sentence              ADR1         ADR2     
112 135 21  EffexorXR.21    1    lack of good feeling.     good        feeling
113 135 21  EffexorXR.21    1                               1
115 136 21  EffexorXR.21    2    Feel disconnected        disconnected   feel    
116 136 21  EffexorXR.21    2                                             0
118 142 22  EffexorXR.22    1    Weight gain               gain         
119 142 22  EffexorXR.22    1                                1             

In column ADR1 and ADR2, for each word, there should be 1 or 0 in the row blow it. if the value is missing, I need to replace it with "0". This is the output:

A  B    C         D         R        sentence              ADR1         ADR2     
112 135 21  EffexorXR.21    1    lack of good feeling.     good        feeling
113 135 21  EffexorXR.21    1                               1             0
115 136 21  EffexorXR.21    2    Feel disconnected        disconnected   feel    
116 136 21  EffexorXR.21    2                                 0            0
118 142 22  EffexorXR.22    1    Weight gain               gain         
119 142 22  EffexorXR.22    1                                1    

I tried

df[ADR1].fillna(0, inplace=True) and df[ADR2].fillna(0, inplace=True)

but this code produce the following df, which is not wanted

 A  B    C         D         R        sentence              ADR1         ADR2     
112 135 21  EffexorXR.21    1    lack of good feeling.     good        feeling
    113 135 21  EffexorXR.21    1                               1        0
    115 136 21  EffexorXR.21    2    Feel disconnected        disconnected   feel                                                                 0
    116 136 21  EffexorXR.21    2                                             0
    118 142 22  EffexorXR.22    1    Weight gain               gain           0
    119 142 22  EffexorXR.22    1                                1            0 
Mary
  • 1,142
  • 1
  • 16
  • 37
  • @open-Source, last row just have one word and it has 1 below it. ADR2 for last row is null, so the row blow it is also null. – Mary Jun 25 '17 at 02:07
  • I realized later so, removed my previous comment, thanks – niraj Jun 25 '17 at 02:07
  • Please show us what you've tried. SO is a place to get help with specific program coding issues. – stevieb Jun 25 '17 at 02:11

2 Answers2

3

You can use reshape to allow working on the data every other row at a time. Something like:

Code:

for col in ['ADR1', 'ADR2']:
    data = np.reshape(df[col].values, (-1, 2))
    need_fill = np.logical_and(data[:, 0] != '', data[:, 1] == '')
    data[np.where(need_fill),1] = 0

Test Code:

import pandas as pd
from io import StringIO
import numpy as np

df = pd.read_fwf(StringIO(u"""
    A   B   C   D             R  sentence              ADR1         ADR2     
    112 135 21  EffexorXR.21  1  lack of good feeling  good         feeling
    113 135 21  EffexorXR.21  1                        1
    115 136 21  EffexorXR.21  2  Feel disconnected     disconnected feel    
    116 136 21  EffexorXR.21  2                                     0
    118 142 22  EffexorXR.22  1  Weight gain           gain         
    119 142 22  EffexorXR.22  1                        1"""),
                 header=1).fillna('')

print(df)
for col in ['ADR1', 'ADR2']:
    data = np.reshape(df[col].values, (-1, 2))
    need_fill = np.logical_and(data[:, 0] != '', data[:, 1] == '')
    data[np.where(need_fill),1] = 0
print(df)

Results:

     A    B   C             D  R              sentence          ADR1     ADR2
0  112  135  21  EffexorXR.21  1  lack of good feeling          good  feeling
1  113  135  21  EffexorXR.21  1                                   1         
2  115  136  21  EffexorXR.21  2     Feel disconnected  disconnected     feel
3  116  136  21  EffexorXR.21  2                                            0
4  118  142  22  EffexorXR.22  1           Weight gain          gain         
5  119  142  22  EffexorXR.22  1                                   1         

     A    B   C             D  R              sentence          ADR1     ADR2
0  112  135  21  EffexorXR.21  1  lack of good feeling          good  feeling
1  113  135  21  EffexorXR.21  1                                   1        0
2  115  136  21  EffexorXR.21  2     Feel disconnected  disconnected     feel
3  116  136  21  EffexorXR.21  2                                   0        0
4  118  142  22  EffexorXR.22  1           Weight gain          gain         
5  119  142  22  EffexorXR.22  1                                   1         
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
  • 1
    thank you. If I have more than 2 columns for ADR, I just need to add it to the list of ['ADR1', 'ADR2'], and other part of the code does not change. Yes ? – Mary Jun 25 '17 at 02:28
  • Give it a try. If you copy and paste the section under *Test Code*, it is standalone. Rename the column in the test data, and then try your suggested change and see if it still works. – Stephen Rauch Jun 25 '17 at 02:32
  • I copied your code, but it did not work. My data frame has just 2 column of ADR1 and ADR2 , any suggestion ? – Mary Jun 25 '17 at 03:02
  • I would need more than a simple *it does not work*. What does it not do? – Stephen Rauch Jun 25 '17 at 03:19
  • is there any place that I can upload a subset of my document that you can access to it ? – Mary Jun 25 '17 at 08:47
  • Thank you. Would you please answer this question, it is very similar to this question. https://stackoverflow.com/questions/44747810/replacing-nan-value-with-a-word-when-nan-is-not-repeated-in-two-consecutive-rows – Mary Jun 25 '17 at 16:03
1

Other way you can try is to iterate through rows of dataframe for each column to check and check if next value is empty given the first value is not empty, then update the value to 0:

col_list = ['ADR1', 'ADR2'] # columns to check
for column in col_list: # for each column go through each rows
    # however the step size is 2 at a time since current and next is checked
     for i in range(0, df.shape[0]-1, 2): 
        first_val = df.loc[i][column]
        next_val = df.loc[i+1][column]
        # check if given current not empty, is next empty
        if not first_val == '' and next_val  == '':
            df.loc[i+1, column] = 0 # update the value
niraj
  • 17,498
  • 4
  • 33
  • 48
  • it has the following error: KeyError: 'the label [2] is not in the [index]', Do you know why ? – Mary Jun 25 '17 at 03:09
  • @Mary not sure, I did not get any error. on which line did it actually gave error in? Can you check if replacing `column` with `str(column)` works? https://stackoverflow.com/a/33590065/5916727 – niraj Jun 25 '17 at 03:12
  • yes it solved. Would you please answer this question: https://stackoverflow.com/questions/44747230/replacing-missing-with-a-word-based-on-values-of-the-next-row-if-value-in-next – Mary Jun 25 '17 at 14:36
  • @Mary Great! do you mean adding `str(column)` helped or you found different solution? (*Just curious*) I will try to look into other question. – niraj Jun 25 '17 at 14:41
  • Yes, This is the link for other question: https://stackoverflow.com/questions/44747810/replacing-nan-value-with-a-word-when-nan-is-not-repeated-in-two-consecutive-rows – Mary Jun 25 '17 at 16:01