2

I have a dataframe (df) (orginally from a excel file) and the first 9 rows are like this:

      Control      Recd_Date/Due_Date                Action        Signature/Requester
0     2000-1703   2000-01-31 00:00:00           OC/OER/OPA/PMS/                 M WEBB
1           NaN   2000-02-29 00:00:00                       NaN              DATA CORP
2     2000-1776   2000-01-02 00:00:00            OC/ORA/OE/DCP/                  G KAN
3           NaN   2000-01-03 00:00:00           OC/ORA/ORO/PNC/              PALM POST
4           NaN                   NaN  FDA/OGROP/ORA/SE-FO/FLA-                    NaN
5           NaN                   NaN                DO/FLA-CB/                    NaN
6     2000-1983   2000-02-02 00:00:00  FDA/OGROP/ORA/CE-FO/CHI-                 M EGAN
7           NaN   2000-02-03 00:00:00                DO/CHI-CB/   BERNSTEIN LIEBHARD &
8           NaN                   NaN                       NaN             LONDON LLP
  • Type(df['Control'][1])=float;
  • Type(df['Recd_Date/Due_Date'][1])=datetime.datetime;
  • type(df['Action_Office'][1])=float;
  • Type(df['Signature/Requester'][1])=unicode

I want to transform this dataframe (e.g. first 9 rows) to this:

      Control            Recd_Date/Due_Date                           Action                                                            Signature/Requester
0     2000-1703   2000-01-31 00:00:00,2000-02-29 00:00:00           OC/OER/OPA/PMS/                                                      M WEBB,DATA CORP
1     2000-1776   2000-01-02 00:00:00,2000-01-03 00:00:00           OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-FO/FLA-DO/FLA-CB/      G KAN,PALM POST
2     2000-1983   2000-02-02 00:00:00,2000-02-03 00:00:00           FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/                                   M EGAN,BERNSTEIN LIEBHARD & LONDON LLP

So basically:

  • Everytime pd.isnull(row['Control']) (This should be the only if condition) is true then merge this row with the previous row (whose 'control' value is not null).
  • And for 'Recd_Date/Due_Date' and 'Signature/Requester', add ',' (or '/') between each two values (from two merged rows) (e.g. '2000-01-31 00:00:00,2000-02-29 00:00:00' and 'G KAN,PALM POST')
  • For 'Action', simply merge them without any punctuations added (e.g. FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/)

Can anyone help me out pls? This is the code im trying to get it to work:

for i, row in df.iterrows():
    if pd.isnull(df.ix[i]['Control_#']):
       df.ix[i-1]['Recd_Date/Due_Date'] = str(df.ix[i-1]['Recd_Date/Due_Date'])+'/'+str(df.ix[i]['Recd_Date/Due_Date'])
       df.ix[i-1]['Subject'] = str(df.ix[i-1]['Subject'])+' '+str(df.ix[i]['Subject'])
       if str(df.ix[i-1]['Action_Office'])[-1] == '-':
           df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+str(df.ix[i]['Action_Office'])
       else:
           df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+','+str(df.ix[i]['Action_Office'])
       if pd.isnull(df.ix[i-1]['Signature/Requester']):
           df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+str(df.ix[i]['Signature/Requester'])
       elif str(df.ix[i-1]['Signature/Requester'])[-1] == '&':
           df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+' '+str(df.ix[i]['Signature/Requester'])
       else:
           df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+','+str(df.ix[i]['Signature/Requester'])
       df.drop(df.index[i])

How come the drop() doesn't work? I am trying drop the current row (if its ['Control_#'] is null) so the next row (whose ['Control_#'] is null) can be added to the previous row (whose ['Control_#'] is NOT null) iteratively..

Much appreciated!!

Blue482
  • 2,926
  • 5
  • 29
  • 40
  • 1
    Have you looked at the shift method? http://stackoverflow.com/questions/22081878/get-previous-rows-value-and-calculate-new-column-pandas-python – David Greydanus Jul 28 '14 at 00:34

1 Answers1

6

I think you need to group the rows together and then join up the column values. The tricky part is finding a way to group together the rows in the way you want. Here is my solution...

1) Grouping Together the Rows: Static variables

Since your groups depend on a sequence in your rows I used a static variable in a method to label every row to a specific group

def rolling_group(val):
    if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
    return rolling_group.group
rolling_group.group = 0 #static variable

This method is applied along the Control series to sort indexes into groups, which is then used to split up the dataframe to allow you to merge rows

#groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)

That is really the only tricky part after that you can just merge the rows by applying a function to each group that gives you your desired output

Full Solution Code

def rolling_group(val):
    if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
    return rolling_group.group
rolling_group.group = 0 #static variable

def joinFunc(g,column):
    col =g[column]
    joiner = "/" if column == "Action" else ","
    s = joiner.join([str(each) for each in col if pd.notnull(each)])
    s = re.sub("(?<=&)"+joiner," ",s) #joiner = " "
    s = re.sub("(?<=-)"+joiner,"",s) #joiner = ""
    s = re.sub(joiner*2,joiner,s)    #fixes double joiner condition
    return s

#edit above - str(each) - to convert to strings... edit above regex to clean join string joins

if __name__ == "__main__":
    df = """      Control      Recd_Date/Due_Date                Action        Signature/Requester
0     2000-1703   2000-01-31 00:00:00           OC/OER/OPA/PMS/                 M WEBB
1           NaN   2000-02-29 00:00:00                       NaN              DATA CORP
2     2000-1776   2000-01-02 00:00:00            OC/ORA/OE/DCP/                  G KAN
3           NaN   2000-01-03 00:00:00           OC/ORA/ORO/PNC/              PALM POST
4           NaN                   NaN  FDA/OGROP/ORA/SE-FO/FLA-                    NaN
5           NaN                   NaN                DO/FLA-CB/                    NaN
6     2000-1983   2000-02-02 00:00:00  FDA/OGROP/ORA/CE-FO/CHI-                 M EGAN
7           NaN   2000-02-03 00:00:00                DO/CHI-CB/   BERNSTEIN LIEBHARD &
8           NaN                   NaN                       NaN             LONDON LLP"""
    df =  pd.read_csv(StringIO.StringIO(df),sep = "\s\s+",engine='python')

    groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)
    groupFunct = lambda g: pd.Series([joinFunc(g,col) for col in g.columns],index=g.columns)
    print groups.apply(groupFunct)

output

     Control                       Recd_Date/Due_Date  \
0  2000-1703  2000-01-31 00:00:00,2000-02-29 00:00:00   
1  2000-1776  2000-01-02 00:00:00,2000-01-03 00:00:00   
2  2000-1983  2000-02-02 00:00:00,2000-02-03 00:00:00   

                                              Action  \
0                                    OC/OER/OPA/PMS/   
1  OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-...   
2                 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/   

                      Signature/Requester  
0                        M WEBB,DATA CORP  
1                         G KAN,PALM POST  
2  M EGAN,BERNSTEIN LIEBHARD & LONDON LLP  
ZJS
  • 3,991
  • 2
  • 15
  • 22
  • Oh thanks so much!! In my original dataframe, type(df['Recd_Date/Due_Date'][i]) = datetime.datetime. So when I run your code on my dataframe, I get "TypeError: sequence item 1: expected string or Unicode, datetime.datetime found"... I converted the ['Recd_Date/Due_Date'] column to str as 'df['Recd_Date/Due_Date'] = df['Recd_Date/Due_Date'].apply(str)' but still get the same error. Why? thanks.. – Blue482 Jul 28 '14 at 18:48
  • 1
    I believe it is because the string.join(iterable) method only takes iterables with strings in it. I edited the joinFunction method to convert all items to a string in the list comprehension section of it. Your other columns also use non string/unicode types. One more tip is you can use astype to convert a columns data type – ZJS Jul 28 '14 at 19:27
  • Thanks very much for your help man! Can you pls help me to add two more conditions as i failed to add them to your code myself...1, For column 'Action' if its value ends with character '-', then joiner = "" else ",". 2, For column 'Signature/Requester' if its value ends with character '&', then joiner = " " else ",". Like this: – Blue482 Jul 29 '14 at 18:57
  • `if str(df.ix[i-1]['Action'])[-1] == '-': df.ix[i-1]['Action'] = str(df.ix[i-1]['Action'])+str(df.ix[i]['Action']) else: df.ix[i-1]['Action'] = str(df.ix[i-1]['Action'])+','+str(df.ix[i]['Action'])` – Blue482 Jul 29 '14 at 18:58
  • `if pd.isnull(df.ix[i-1]['Signature/Requester']): df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+str(df.ix[i]['Signature/Requester']) elif str(df.ix[i-1]['Signature/Requester'])[-1] == '&': df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+' '+str(df.ix[i]['Signature/Requester']) else: df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+','+str(df.ix[i]['Signature/Requester'])` Thanks!! – Blue482 Jul 29 '14 at 18:58
  • 1
    I added some additional regex to clean up the joins, do you understand what is going on in the code? – ZJS Jul 30 '14 at 13:37
  • Thanks! For the most part I understand but I don't really get the way you used `joiner.join()`. If each is notnull then each = each + joiner ( I only see you joined one each with a joiner); but how did you manage to get `If next_each is null: then returns each (joiner not added); else if next_each is not null: then returns each = each + joiner + next_each` work?? Thanks for your patience man, really appreciated! And btw `joiner.join()` doesn't seem to work when type(Recd_Date/Due_Date) becomes pandas.tslib.Timestamp ('NaN's becomes 'NaT's), any idea why? Thanks! – Blue482 Jul 30 '14 at 16:07
  • 1
    I used the re.sub lines to change something like "-/" into "-"/ with every condition. Code should work with NaTs. Want to post the creation of the dataframe so I can work with the dtypes you are using? `df.to_dict('records')[:10]` – ZJS Jul 30 '14 at 16:20
  • You can download the sample data from `https://www.dropbox.com/s/0w9f02hi9qoccud/foi_first2page%28from_freepdfconvert%29.xlsx` and `read_excel()`. Or from `https://www.dropbox.com/s/e22qyo3hzcr8ca4/foi_first2page%28from_adobe%29.xlsx` but after i used read_excel() i found the column names are embedded in row 36 and 37...Thanks! – Blue482 Jul 30 '14 at 20:04
  • df.to_dict('Recd_Date/Due_Date')[:10] Out[22]: [{u'Action_Office': u'OC/OER/OPA/PMS/', u'Control_#': u'2000-1703', u'Recd_Date/Due_Date': Timestamp('2000-01-31 00:00:00', tz=None), u'Signature/Requester': u'M FRIEDMAN', u'Subject': u'DLECTRONIC DATA DUMP - PROD RECALLS, ALERTS &'}, {u'Action_Office': nan, u'Control_#': nan, u'Recd_Date/Due_Date': Timestamp('2000-02-29 00:00:00', tz=None), u'Signature/Requester': u'DATA CONTROL CORP', u'Subject': u'WARNINGS, ENFORCEMENT RPTS'}, ... ... – Blue482 Jul 30 '14 at 20:09
  • I meant I don't quite understand `s = joiner.join([str(each) for each in col if pd.notnull(each)])`, as for what i understand you joined 'each' with 'joiner' (e.g. `M EGAN` joined with `,`), but for joining `M EGAN` `,` with next_each aka `BERNSTEIN LIEBHARD &`, it's the `pd.Series([joinFunc(g,col) for col in g.columns],index=g.columns)` that does the job right? And how did you make sure if next_each is null (e.g. there's nothing below `M EGAN`) then `M EGAN` does not join by joiner `,`? thanks! – Blue482 Jul 31 '14 at 12:20
  • 1
    Hi so I just used your dtypes and all the code that I posted is working. As for how I removed the null values from the list look at this piece of code... `[str(each) for each in col if pd.notnull(each)]` It is a list comprehension, and at the end there is an if statement removing null values. The code returns all non null values from your column casted as strings in another list. – ZJS Jul 31 '14 at 12:53
  • @ZJS very nice. Is there any way to do this with a generator instead of static variable? It would be more pythonic. – michel Mar 24 '17 at 16:59