1

I have a large distance matrix (about 3GB), looking as follows:

type         street 1   street 2    street 3
coffee       2          1           19
restaurant   3          12          4
restaurant   4          3           2
bar          5          9           7
tram         6          16          1

From:

street1<-c(2,3,4,5,6)
street2<-c(1,12,3,9,16)
street3<-c(19,4,2,7,1)
type<-c("coffee","restaurant","restaurant","bar","tram")
df<-data.frame(type,street1,street2,street3)

Actual data is a few thousand columns by a few thousand rows. I want to find the first, second, third etc. closest 'types' for each column ('street'). Ideally, output would look something like this:

street    closest.1    closest.2    closest.3   distclosest.1 distclosest.2  etc.
street1   coffee       restaurant   restaurant  2              3
street2   coffee       restaurant   bar         1              3
street3   tram         restaurant   restaurant  1              2

Hence also preserving the distances of the closest types. Further, when there is an equal distance between two types, one of them can be chosen.

I have succeeded with selecting the first closest using a code including (and by setting the first 'type' column as row names):

[apply(df,2,which.min)]

Yet I don't know how to extend this to second, third closest etc.

Naturally, I have investigated related articles. For example, I have tried to use all answers provided here:

Fastest way to find *the index* of the second (third...) highest/lowest value in vector or column

or

Fastest way to find second (third...) highest/lowest value in vector or column

But they either gave me errors or I couldn't tweak them into my preferred output (due to my limited R knowledge). Or (as indicated) because of the size of the file, it took too long to run.

Further, I tried to accomplish the same another way, by trying to replace the minimum value per column by something like 1000000, so that I could again use which.min (which is, I guess, a rather cumbersome way). I tried to use the code for this provided as answer in:

Replace maximum value of each column

But it gave me a bunch of errors. Doing it in different ways also replaced values from other columns.

Any thoughts on how to tackle this issue? Thanks so much in advance!

Community
  • 1
  • 1
zoekdestep
  • 99
  • 1
  • 7
  • Perhaps you could "delete" the found min from your "working data.frame", to get the second nearest and iterate over that? – Christoph Jul 14 '16 at 11:05
  • That would be a nice possibility, yet would you have a suggestion for an approach that would work for a large file? (hence, no loops etc) – zoekdestep Jul 14 '16 at 12:36
  • Perhaps somebody has a solution when you supply a reproducible example. – Christoph Jul 14 '16 at 12:45
  • Please tell me what you are missing from above - in what way is it not a reproducible example? – zoekdestep Jul 14 '16 at 12:48

0 Answers0