0

Hi I am really new to Pandas. I tried to figure out what's going on with the datatype here but so far I am unable to go very far.

What I intend to do is very simple indeed. I am searching for the index of a DataFrame data2 with the nearest time to the target time in data1.

Since data1 and data2 are very similar, just that there are some minor time difference due to slightly different sampling rate, I attach only the sample of data1 here:

enter image description here

I did something like this in the search of closest match data by comparing the timestamp in data2 to timestamp in data1:

idxcollect = []
for loopidx, tstamploop in enumerate( tstamp_data1[820990:821000] ):
    idxtemp = data2[ data2['timestamp'] == tstamp_data2.asof(tstamploop) ].index
    delta1 = np.abs( data2.timestamp[idxtemp]     - data1.timestamp[loopidx] )
    delta2 = np.abs( data2.timestamp[idxtemp + 1] - data1.timestamp[loopidx] )

    if delta1.iloc[0] < delta2.iloc[0]:
        idxreturn = idxtemp
        idxcollect.append( idxreturn )
    else:
        idxreturn = idxtemp + 1
        idxcollect.append( idxreturn )

tstamp_data1 / tstamp_data2 is of dtype('<M8[ns]'), calculated from epoch time in data1 and data2.

The output I got is:

[Int64Index([809498], dtype='int64'), Int64Index([809499], dtype='int64'), Int64Index([809500], dtype='int64'), Int64Index([809501], dtype='int64'), Int64Index([809502], dtype='int64'), Int64Index([809503], dtype='int64'), Int64Index([809509], dtype='int64'), Int64Index([809513], dtype='int64'), Int64Index([809521], dtype='int64'), Int64Index([809533], dtype='int64')]

What I would like to do is to slice corresponding rows of data2 from the indices found through the operation above, with something as simple as:

data2.ix[ idxcollect[:11] ]

However with the Int64Index format, I am unable to do anything as simple as what I wanted to. Is there any way out? Thank you for your time and attention and help!!

Harry MacDowel
  • 843
  • 1
  • 10
  • 17
  • 1
    include some example data. you should only need ~10 rows to demonstrate what you're trying to accomplish. – Paul H Dec 04 '14 at 18:45

2 Answers2

1

You may store the index of data2 as a list, make timestamps of data1 a list and create a new DataFrame for storing the data:

data2indx = data2.index.tolist()
data1tm = data1['timestamp'].tolist()
data2sub = pd.DataFrame(columns = data2.columns)

Then slice data2 and append row to data2sub based on selection:

for n, i in enumerate(data1tm):
    c = [abs(i-j) for j in data2indx]
    mins = min(c)
    index = c.index(mins)
    data2sub.loc[n] = data2.iloc[index]

Maybe someone can contribute a more efficient approach.

Robbie Liu
  • 1,511
  • 1
  • 11
  • 16
  • Looks kind of promising. Thanks for the prompt reply. Let me try it out and see... – Harry MacDowel Dec 05 '14 at 03:55
  • Your function works perfectly for a small data range. However, it ate up all my 16 Gb of RAM for just a mere processing of 29.6 Mb for data1 and 27.7 Mb of data2...and then the iPython decided to kill the kernel. Is there any way out? Thanks. – Harry MacDowel Dec 05 '14 at 17:18
  • @HarryMacDowel Maybe generated lists are too big so hog system resources. Try the revision. – Robbie Liu Dec 05 '14 at 18:33
  • Thanks Robbie. Although this is not the most efficient version, at least it does not eat up all the RAMs. The whole execution takes ~60 hours to process a mere 820,000 rows of data. I hope someday maybe someone can think of a better solution. For now, I am very grateful for your commitment to help me to advance forward. – Harry MacDowel Dec 07 '14 at 11:42
0

I found a way out to solve the speed problem. The thing is, it takes more time to process a search for the nearest timestamp comparing to a search for the nearest float value.

So the trick is, if you have already noticed in the data, I already have a timesec column.

enter image

What I did was set the first timestamp as 0 and then from then onwards add the corresponding timedelta calculated from the timestamp to the baseline of 0. This yields the timesec column, an easy and fast computation.

In this question, I asked about the "iterable" numbers, and as Robbie has pointed out, the .tolist() function will solve the problem of nested list of lists. However, it takes 60 hours to search for a mere 87258 timestamps in another dataset. To speed this up, you can utilize the timesec for a cleaner and faster search instead.

By implementing a simple getnearpos function from a previous Stack Overflow answer:

def getnearpos(array,value):
    idx = (np.abs(array-value)).argmin()
    return idx   

The search for 87258 timestamps now turns into a search for 87258 float numbers, and the time it takes to finish the search is: 1h 1min 23s, a huge improvement compared to the ~60 hours mark.

If anyone who happens to view this question knows of a faster implementation, do share with me. I am really eager to learn!! Thank you!

Community
  • 1
  • 1
Harry MacDowel
  • 843
  • 1
  • 10
  • 17