1

I am starting to dig deeper into Python and am having trouble converting some of my R scripts into Python. I have a function defined in R:

Shft_Rw <- function(x) { for (row in 1:nrow(x))
{
  new_row = x[row , c(which(!is.na(x[row, ])), which(is.na( x[row, ])))]
  colnames(new_row) = colnames(x)
  x[row, ] = new_row
}
  return(x)  
}

Which essentially takes leading NA's of each row in a dataframe and puts them at the end of the row i.e.

import pandas as pd
import numpy as np
df =pd.DataFrame({'a':[np.nan,np.nan,3],'b':[3,np.nan,5],'c':[3, 4,5]})

df
Out[156]: 
     a    b  c
0  NaN  3.0  3
1  NaN  NaN  4
2  3.0  5.0  5

turns into:

df2 =pd.DataFrame({'a':[3,4,3],'b':[3,np.nan,5],'c':[np.nan, np.nan,5]})
df2
Out[157]: 
   a    b    c
0  3  3.0  NaN
1  4  NaN  NaN
2  3  5.0  5.0

So far I have:

def Shft_Rw(x):
    for row in np.arange(0,x.shape[0]):
        new_row = x.iloc[row,[np.where(pd.notnull(x.iloc[row])),np.where(pd.isnull(df.iloc[row]))]]

But throwing errors. Using sample df above I can get a row index using iloc and the column positions where it is null/not null (using where()) but can't put the two together (tried numerous variations with more brackets etc.).

df.iloc[1]
Out[170]: 
a    NaN
b    NaN
c    4.0

np.where(pd.isnull(df.iloc[1]))
In[167] :  np.where(pd.isnull(df.iloc[1]))
Out[167]: (array([0, 1], dtype=int64),)

df.iloc[1,np.where(pd.notnull(df.iloc[1]))]

Anyone able to help replicate the function AND/OR show a more efficient way to solve the problem?

Thanks!

M--
  • 25,431
  • 8
  • 61
  • 93
HowdyDude
  • 483
  • 2
  • 5
  • 14
  • What should happen with a row such as "2 NaN 3"? Is the expected output "2 NaN 3" or "3 2 NaN"? – Mr. T Jul 08 '18 at 00:16
  • For my specific purpose of analysis I would do either a forward fill with the last actual result OR a simple linear interpolation i.e. ( 2, 2, 3) or (2, 2.5, 3). Even further, if original line was (NA, NA, 2, NA, 3) I would want it transformed to : (2, 2, 3, NA, NA) I haven't seen any instance of that in my dataset yet, but great question - as I am sure that instance could arise. – HowdyDude Jul 08 '18 at 11:48

1 Answers1

2

Use apply with dropna:

df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
df1.columns = df.columns
print (df1)
     a    b    c
0  3.0  3.0  NaN
1  4.0  NaN  NaN
2  3.0  5.0  5.0

If performance is important I suggest use this perfect justify function:

arr = justify(df.values, invalid_val=np.nan, axis=1, side='left')
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df1)
     a    b    c
0  3.0  3.0  NaN
1  4.0  NaN  NaN
2  3.0  5.0  5.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Awesome! That worked - just had to do one interim step. Apparently using groupby changes nan's to 0, so just had to do a .replace(0, np.nan) before your solution. Thanks! – HowdyDude Jul 08 '18 at 14:31
  • On second thought it was probably the .aggregate(np.sum) which converted the nan's – HowdyDude Jul 08 '18 at 14:44
  • @HowdyDude I think is possible use `.sum(min_count=1)` instead `.aggregate(np.sum)`, chek [this](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#arithmetic-operations) – jezrael Jul 08 '18 at 14:49