0

I have two arrays

n1 = pd.Series([1,2,3, np.nan, np.nan, 4, 5], index=[3,4,5,6,7,8,9])
n2 = pd.Series([np.nan, np.nan, 4, 5, 3,], index=[2, 4, 5, 10, 11])

the data format is like following and the last column is the result I want to get:

index  n1  n2  resultexpected(n1<n2)
 2         na   na
 3      1       na
 4      2  na   na    
 5      3  4    True
 6      na      na
 7      na      na
 8      4       na
 9      5       na
 10        5    na
 11        11   na

Here is my solution and it is very inefficient.

n1 = pd.Series([1,2,3, np.nan, np.nan, 4, 5], index=[3,4,5,6,7,8,9])
n2 = pd.Series([np.nan, np.nan, 4, 5, 3,], index=[2, 4, 5, 10, 11])

def GT(n1, n2):
    n1_index = n1.index.values
    n2_index = n2.index.values
    index = np.sort(list(set(list(n1_index) + list(n2_index))))

    new_n1 = pd.Series(np.nan, index=index)
    new_n1.loc[n1_index] = n1.values
    new_n2 = pd.Series(np.nan, index=index)
    new_n2.loc[n2_index] = n2.values

    output = pd.Series(new_n1.values < new_n2.values, index=index)
    output.loc[n1[n1.isnull()].index] = np.nan
    output.loc[n2[n2.isnull()].index] = np.nan
    return output

starttime = datetime.datetime.now()
for i in range(500):
    GT(n1, n2)

endtime = datetime.datetime.now()
print(endtime - starttime)

My rough idea is to rebuild two arrays with identical index list and compare them. But the currently solution is very slow. The for loop is what I use to test the computation cost.

The difficult point to me is how to efficiently compare the two values at the same index, and what's the best way to nullify the output result if there isn't a value in array n1 or n2.

Is there any better solutions please? Especially, time efficient way.

Asuralm
  • 85
  • 9
  • Thanks Ann. I read this post and indeed it is very similar to my question. But my main headache is how to improve the computational and time efficiency. There are quite many way to code this. What I am looking for is the fastest way in time. That's why I use 500 loops to measure the time and why I try to use numpy as much as possible. – Asuralm Nov 10 '20 at 14:44
  • Closed wrongly, so reopened. – jezrael Nov 10 '20 at 16:10

1 Answers1

0

Use concat here for 2 columns DataFrame:

df = pd.concat([n1, n2], axis=1, keys=('n1','n2'))
print (df['n1'] < df['n2'])
2     False
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11    False
dtype: bool

If need new column with missing values is possible use here map:

df = pd.concat([n1, n2], axis=1, keys=('n1','n2'))
df['resultexpected'] = (df['n1'] < df['n2']).map({True:True, False: np.nan})
print (df)
     n1   n2 resultexpected
2   NaN  NaN            NaN
3   1.0  NaN            NaN
4   2.0  NaN            NaN
5   3.0  4.0           True
6   NaN  NaN            NaN
7   NaN  NaN            NaN
8   4.0  NaN            NaN
9   5.0  NaN            NaN
10  NaN  5.0            NaN
11  NaN  3.0            NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks. that's a much more beautiful way to do this. However, I compared this solution with my solution. The time cost is very similar. On my computer, my solution is around 3s for 500 loops, and this one is similar. Is there a time efficient way to do this please? – Asuralm Nov 10 '20 at 14:39
  • @Asularm Not sure, unfortunately. – jezrael Nov 10 '20 at 16:11