0

Recently, I am converting from SAS to Python pandas. One question I have is that does pandas have a retain like function in SAS.

My SAS code is:

 data df1;
retain col3 "    ";
set df;
by ID  ;
if first.ID  then col3=col1;
else   col3=col3;

the other condition i have is in SAS code is:

 data df1;
retain col3;
set df;
by ID  ;
if first.ID  then col3=1;
else   col3=col3+1;

Next SAS code which i am trying to convert is shown below:

   proc sql;
   create table t1 as
     select 
      c1, c2, c3, c4, c5, flag, max(flag) as MAX_flag
   from t1
    group by c1, c2, c3,  c5;
  run;

I tried it in pandas as below but seems i am making some silly mistakes. If anyone know how to replicate above sas code block3 into pandas

  t1=t1[['c1','c2','c3','c4','c5','c6']]
  t1.loc[:,'Max_flag']=t1['flag'].max()
  t1.groupby(['c1','c2','c3','c5'])

In my eg. col3 is B and col1 is a. The condition on which it should be done is df.groupby(['ID'],as_index=False).first()

I have 2 columns ID, A. my requirement is Input data df1:

 ID    A
  1    a
  1    b
  2    c
  1    p
  2    q

Output dataframe should have one more column name as B. It will group by ID.first(). and copy the data of col A to col B for all the grouped ID.

Output should be df1

 ID    A    B
  1    a    a
  1    b    a
  2    c    c
  1    p    a
  2    q    c

 **My key requirement is to convert above SAS code to Pnadas**
user07
  • 658
  • 3
  • 13
  • 27
  • Could you please provide an example of input and output data? – Ilya V. Schurov Nov 19 '16 at 20:41
  • Input would be like table of say 10 columns having id column... and i need output data should contain table of 11 columns with values in our new column i.e. col_1 from col_2. Also no. of records needs to be same in new dataframe also. – user07 Nov 19 '16 at 20:54
  • Could you please edit your question and include example data? It will make it much more readable and allow others to understand what do you ask better. See [here](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for advices on how to do it better. – Ilya V. Schurov Nov 19 '16 at 20:57
  • 1
    Thanks for the data, but it doesn't explain what are you going to do. Current data suggests that simple `df['col4']=df['col1']` will work, but I believe you are asking about something more complicated. Could you please give data example that will explain what it is needed to do? – Ilya V. Schurov Nov 19 '16 at 21:22
  • actually i need to convert that sas code which i mentioned into pandas. Yes you are correct its a simple data i need to do complex transformation. what i need is as of now is covert the sas code into pyspark and get the same result. Any help would be appreciated. – user07 Nov 19 '16 at 21:51
  • 2
    Can you provide *simpl*e data example that makes it clear what transformation do you need? You provide SAS code as an explanation but I have no experience with SAS so I cannot understand from that code what is needed. Of course, you can just wait until somebody who knows both SAS and pandas appear and answer your question, but in my opinion it is much better to make your question understandable by those who know only pandas. It will reduce the time you need to wait before you get an answer. – Ilya V. Schurov Nov 19 '16 at 21:55
  • @user07, as Ilya V. Schurov has already said you would better provide [a minimal, complete, verifiable example](http://stackoverflow.com/help/mcve) or you will have to wait for rare experts that know both SAS __and__ Pandas/PySpark. Pleas read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). IMO providing a small reproducible data set (3-5 rows) and desired data set will increase the chance to get an answer dramatically... – MaxU - stand with Ukraine Nov 19 '16 at 22:32
  • @Ilya V. Schurov, I have updated my question do you have any idea how to implement above scinero in pandas – user07 Nov 24 '16 at 17:29

2 Answers2

2

Consider a merge with groupby.first():

df = df.merge(df.groupby('ID').first().reset_index(), on='ID').rename(columns={'A_x':'A', 'A_y':'B'})

#    ID  A  B
# 0   1  a  a
# 1   1  b  a
# 2   1  p  a
# 3   2  c  c
# 4   2  q  c

This can also be generalized with groupby.nth() which is zero based. Below takes the second value in each ID:

df = df.merge(df.groupby('ID').nth(1).reset_index(), on='ID').rename(columns={'A_x':'A', 'A_y':'B'})

#    ID  A  B
# 0   1  a  b
# 1   1  b  b
# 2   1  p  b
# 3   2  c  q
# 4   2  q  q

And as well as groupby.last():

df = df.merge(df.groupby('ID').last().reset_index(), on='ID').rename(columns={'A_x':'A', 'A_y':'B'})

#    ID  A  B
# 0   1  a  p
# 1   1  b  p
# 2   1  p  p
# 3   2  c  q
# 4   2  q  q
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • above answer is correct for if first.ID .. then i want to keep the value same. but if not first.ID and then i need to change the values how i can achieve this ? I mean this if first.ID is condition for me on basis of this i have set values as if and else. How i will set value for else condition ?? – user07 Nov 25 '16 at 16:24
  • Please re-phrase your comment as I cannot understand. The `groupby.first()` here yields the same as your desired `df1` output in above question (just a changed ID ordering). If you are changing your original requirement, please ask a separate StackOverflow question. – Parfait Nov 25 '16 at 16:32
  • i have edited my question. i am not changing my original requirement just adding one else condition for which i need to update records. basically i need to convert above SAS code to pandas or pyspark – user07 Nov 25 '16 at 17:17
  • What is the SAS data input? I am not seeing a col3. – Parfait Nov 25 '16 at 20:42
  • its a new column in dataframe which will get add – user07 Nov 25 '16 at 21:52
  • I am not understanding. I do have SAS and I just ran your first code block which exactly matches the answer here for `groupby.first()` (just different col names). Your second is looking for a running group count: `df['B'] = df.groupby('ID').cumcount()+1` – Parfait Nov 25 '16 at 23:41
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/129079/discussion-between-user07-and-parfait). – user07 Nov 26 '16 at 01:01
  • above condition is not working. the else part of SAS code block 1 is not getting executed. it would be really helpful if can help me to show the else condition as well. for groupby(id).first() block of sas code equivalent in pandas – user07 Nov 26 '16 at 15:32
  • Parfait.. your code is working but i have many columns and group by will be done one 1 columns only. So while applying your code merge i am getting so many _x and _y columns ... how to avoid them and get only 1 extra column ? – user07 Nov 29 '16 at 10:32
0

Form what you have tried, looks like you want to have 'col4' equal to 'col2' if 'col1' is not blank, otherwise 'col4' should be blank. So you do not need group-by. Here is how you do the case the case I explained here:

In [80]: df = pd.DataFrame([[1,'a','b','c'],[2,'e','r','g'], [3,'BLANK', '', '']], columns=['ID','col1', 'col2','col3'])

In [81]: df
Out[81]: 
   ID   col1 col2 col3
0   1      a    b    c
1   2      e    r    g
2   3  BLANK          

In [82]: df['col4'] = np.where(df['col1'] == 'BLANK', 'BLANK', df['col2'])

In [83]: df
Out[83]: 
   ID   col1 col2 col3   col4
0   1      a    b    c      b
1   2      e    r    g      r
2   3  BLANK            BLANK
Ali
  • 1,605
  • 1
  • 13
  • 19
  • alivar thanks for your answer. but my actually requirement is to convert my sas code into pyspark or panda dataframes. i need to get same result of sas code as in my panda dataframes – user07 Nov 19 '16 at 21:52
  • as in sas it is being used as a id.first if its true then only it should get assign the col1 value to col4 else keep col4 as default. And also i am not sure about how first works in panda. As in sas it will group all the id with first and then copy the records – user07 Nov 19 '16 at 21:54
  • 1
    That would be great if you update your question with a more clear description. For example, it is not clear what should be the value of col4 if col1 is not blank. – Ali Nov 19 '16 at 22:09