2

I have a Pandas dataframe with 1M rows, 3 columns (TrackP, TrackPt, NumLongTracks) and I want to find pairs of 'matching' rows, such that for say two 'matching' rows the difference between the values for each row of column 1 (TrackP), column 2 (TrackPt) and column 3 (NumLongTracks) are all within some bound i.e. no more than ±1,

     TrackP  TrackPt   NumLongTracks
1    2801    544        102
2    2805    407        65
3    2802    587        70
4    2807    251        145
5    2802    543        101
6    2800    545        111

For this particular case you would only retain the pair row 1 and row 5, because for this pair

TrackP(row 1) - TrackP(row 5) = -1, 
TrackPt(row 1) - TrackP(row 5) = +1, 
NumLongTracks(row 1) - NumLongTracks(row 5) = +1

This is trivial when the values are exactly the same between rows, but I'm having trouble figuring out the best way to do this for this particular case.

Pronitron
  • 177
  • 8
  • I spent some time yesterday trying (and failing) to find a good solution for this. I found two excellent answers elsewhere on the site which work well if your dataset is closer to ~100k rows rather than ~1M. I'll link them here in case they provide inspiration, but unfortunately this is a very difficult problem to solve (as you can probably tell from the fact that you didn't get an answer yesterday even though your question was clear and upvoted). – sjw Nov 29 '19 at 11:12
  • [Answer using numpy broadcasting](https://stackoverflow.com/a/44601120/6866811). [Answer using SQLite](https://stackoverflow.com/a/42796283/6866811) – sjw Nov 29 '19 at 11:15
  • If it's not clear how to apply the linked answers to your dataset, I'll be happy to post example code as answer, but from my tests I don't think they're suitable for a dataset as large as yours. – sjw Nov 29 '19 at 11:18

1 Answers1

0

I think is easier to handle the columns as a single value for comparision.

#new dataframe
tr = track.TrackP.astype(str) + track.TrackPt.astype(str) +   track.NumLongTracks.astype(str)   

# finding matching routes 
matching = []                           
for r,i in zip(tr,tr.index):
    if r[0:4]: #4 position is TrackP
        close   = (int(r[0:4])-1,int(r[0:4])+1) #range 1 up/down
        ptRange = (int(r[5:7])-1,int(r[5:7])+1)
        nLRange = (int(r[8:])-1,int(r[8:])+1)

        for r2 in tr:
            if int(r2[0:4]) in close: #TrackP in range  
                if int(r2[5:7]) in ptRange: #TrackPt in range
                    if int(r2[8:]) in nLRange: #NumLongTracks in range
                        pair = [r,r2]
                        matching.append(pair)


# back to the format
#[['2801544102', '2802543101'], ['2802543101', '2801544102']]   
import collections

routes = collections.defaultdict(list)

for seq in matching:
    routes['TrackP'].append(int(seq[0][0:4]))
    routes['TrackPt'].append(int(seq[0][4:7]))
    routes['NumLongTracks'].append(int(seq[0][7:]))

Now you can easily decompress in a dataframe using the formula:

df = pd.DataFrame.from_dict(dict(routes))

print(df)
TrackP  TrackPt  NumLongTracks
0    2801      544            102
1    2802      543            101
powerPixie
  • 718
  • 9
  • 20
  • I don't think this answer is correct I'm afraid. If the range is instead meant to be from -5 to +5, your code isn't checking if the value is in that range, but rather checks if it is exactly -5 or 5. I also don't think we can make the assumption that the numbers used will always have the same number of digits - this code breaks if TrackP has only 3 digits, for example. And finally, the question states that the real dataset is 1M rows. A nested Python loop of 1Mx1M iterations is probably going to be too slow to be practical. – sjw Nov 29 '19 at 11:19
  • @thesilkworm He said the range is +1 and -1 and we are dealing with integers. I made a code based on the data presented and not on assumptions. And about the performance... what do you suggested to replace the loop? – powerPixie Nov 29 '19 at 11:24
  • Only @Pronitron can say for sure regarding the intended definition of range and the possible values the data might take, so I won't comment further on that. Regarding performance, I'm afraid I don't have a good solution at this time (see my comments under the question and the linked answers) - it's a very difficult problem unfortunately. – sjw Nov 29 '19 at 11:33
  • Hi @thesilkworm, unfortunately, this question was maybe specified rather too specifically. In reality I would want the solution to be more general. I only chose a range of +1 to -1 and integer values for the data as an example. This range would be changeable depending on which variables were chosen in the dataset, and these variables are typically floats rather than integers. – Pronitron Dec 01 '19 at 10:54