1

I have two vectors. For each element of vector A, I would like to know all the elements of vector B that fulfill a certain condition. So, for example, two dataframes containing the vectors:

person <- data.frame(name = c("Albert", "Becca", "Celine", "Dagwood"),
                 tickets = c(20, 24, 16, 17))
prize <- data.frame(type = c("potato", "lollipop", "yo-yo", "stickyhand", 
                         "moodring", "figurine", "whistle", "saxophone"),
                cost = c(6, 11, 13, 17, 21, 23, 25, 30))

For this example, each person in the "person" dataframe has a number of tickets from a carnival game, and each prize in the "prize" dataframe has a cost. But I'm not looking for perfect matches; instead of simply buying a prize, they randomly receive any prize that is within a 5-ticket cost tolerance of what they have.

The output I'm looking for is a dataframe of all the possible prizes each person could win. It would be something like:

    person      prize
1   Albert stickyhand
2   Albert   moodring
3   Albert   figurine
4   Albert    whistle
5    Becca   moodring
6    Becca   figurine
       ...        ...

And so on. Right now, I'm doing this with lapply(), but this is really no faster than a for() loop in R.

library(dplyr)
matching_Function <- function(person, prize, tolerance = 5){
  matchlist <- lapply(split(person, list(person$name)),
                      function(x) filter(prize, abs(x$tickets-cost)<=tolerance)$type)
  longlist <- data.frame("person" = rep(names(matchlist), 
                                    times = unlist(lapply(matchlist, length))),
                         "prize" = unname(unlist(matchlist))
  )
  return(longlist)
}
matching_Function(person, prize)

My actual datasets are much larger (in the hundreds of thousands), and my matching conditions are more complicated (checking coordinates from B to see whether they are within a set radius of coordinates from A), so this is taking forever (several hours).

Are there any smarter ways than for() and lapply() to solve this?

i.Mik
  • 105
  • 7
  • Seems like this would be a pretty easy win for Rcpp, particularly since a naive `for` loop approach would be easy to write and translate to C++. – joran Sep 20 '16 at 14:15
  • Maybe `dplyr` with `tidyr` to give `person %>% rowwise() %>% mutate(prize = paste(prize$type[which(prize$cost >= tickets - 5 & prize$cost <= tickets + 5)], collapse = ',')) %>% separate_rows(prize, sep = ',') %>% select(-tickets)` although I m not sure about its performance – Sotos Sep 20 '16 at 14:40
  • @I.mik updated my answer for coordinates selection to fulfill your second part :) – Tensibai Sep 21 '16 at 15:08
  • 1
    BTW, congrats for your Q, nicely introduced and we'll crafted. Sad it was overly simplified against your real use case, this made answers missing the goal. – Tensibai Sep 21 '16 at 20:19
  • 1
    @Tensibai Appreciate it! I've been consulting StackOverflow for ages, never as an actual user, and was excited to finally have a question that hadn't already been answered in some form or another. – i.Mik Sep 21 '16 at 21:14

2 Answers2

3

An alternative with foverlaps from data.table doing what you wish:

require(data.table)

# Turn the datasets into data.table
setDT(person)
setDT(prize)
# Add the min and max from tolerance
person[,`:=`(start=tickets-tolerance,end=tickets+tolerance)]
# add a dummy column for use as range
prize[,dummy:=cost]
# Key the person table on start and end
setkey(person,start,end)
# As foverlaps to get the corresponding rows from prize into person, filter the NA results and return only the name and type of prize
r<-foverlaps(prize,person,type="within",by.x=c("cost","dummy"))[!is.na(name),list(name=name,prize=type)]
# Re order the result by name instead of prize cost
setorder(r,name)

Output:

       name      prize
 1:  Albert stickyhand
 2:  Albert   moodring
 3:  Albert   figurine
 4:  Albert    whistle
 5:   Becca   moodring
 6:   Becca   figurine
 7:   Becca    whistle
 8:  Celine   lollipop
 9:  Celine      yo-yo
10:  Celine stickyhand
11:  Celine   moodring
12: Dagwood      yo-yo
13: Dagwood stickyhand
14: Dagwood   moodring

I hope I commented enough the code to be self explanatory.


For the second part of the question, using coordinates and testing within a radius.

person <- structure(list(name = c("Albert", "Becca", "Celine", "Dagwood"), 
                         x = c(26, 16, 32, 51), 
                         y = c(92, 51, 25, 4)), 
                    .Names = c("name", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
antenas <- structure(list(name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L"), 
                          x = c(40, 25, 38, 17, 58, 19, 34, 38, 67, 26, 46, 17), 
                          y = c(36, 72, 48, 6, 78, 41, 18, 28, 54, 8, 28, 47)), 
                     .Names = c("name", "x", "y"), row.names = c(NA, -12L), class = "data.frame")

setDT(person)
setDT(antenas)
r<-10

results <- person[,{dx=x-antenas$x;dy=y-antenas$y; list(antena=antenas$name[dx^2+dy^2<=r^2])},by=name]

Data.table allow expression in j, so we can do the maths of the outer join for each person against antennas and return only relevant rows with antenna name.

This should not be to much memory consuming as it's done for each row on person and not as a whole.

Maths inspired by this question

This give:

> results
     name antena
1:  Becca      L
2: Celine      G
3: Celine      H
Community
  • 1
  • 1
Tensibai
  • 15,557
  • 1
  • 37
  • 57
  • This does rather slickly solve my example problem. However (through no fault of your own), I don't think it can be generalized to my actual data. I simplified the tolerance in the example to be distance on a one-dimensional scale. But in my actual dataset, it's a two-dimensional distance that's solved by an equation. So the min/max distance can't be represented by one 'start' and 'end' number. Upvoted anyway since it does solve the example and others may find it useful! – i.Mik Sep 20 '16 at 16:28
  • @I.mik you may post a follow up question. But I'm pretty sure it can be done at worst with 2 foverlaps call and a merge at end. – Tensibai Sep 20 '16 at 17:11
  • @I.Mik this could be even simpler according to [this](http://stackoverflow.com/q/481144/3627607). I'll update tomorow – Tensibai Sep 20 '16 at 17:24
  • Hm. Currently I'm using a custom function that inputs (x1, y1, x2, y2) as _angular (degree)_ coordinates and outputs _linear_ distance. So my logic thus far has been 1: Calculate distance; 2: Check to see if it's under the tolerance. I suppose I could re-write the code to take in (x1, y1) and output an equation of maximum (x2, y2) values... but either way, it's not something where I can just add a column of x-cutoffs and y-cutoffs (since they change with respect to each other). I'm going to need to think about whether I can make that work. – i.Mik Sep 20 '16 at 19:10
  • @I.Mik I've a rough idea on how this could work. If you can add samples datasets in your Q I think I'll be able to solve it. (again according to SO guidelines, this should be a new question, I leave it up to you how you wish to manage it, in case of a new question, add the link here so Icon have a look at it tomorow ;)) – Tensibai Sep 20 '16 at 19:16
  • 1
    Thanks for the edit! My math equation is actually different (it's a circle drawn on a surface of a sphere) but using the basic structure of your solution and subbing out my own math eventually got me exactly where I needed it to be. For a small subset of my actual data, `microbenchmark()` finds that this is modestly faster than my old `lapply()` method. Many thanks! – i.Mik Sep 21 '16 at 17:35
  • @I.Mik glad it helped ;) if it's not really faster, maybe there's room for improvement in the selection function – Tensibai Sep 21 '16 at 18:34
  • And what's the ratio of circle radius against sphere radius ? (what I did use above should fit for any cellular localization in any country, the relative ratio would be really low. Even for a GPS maths (out of extreme cases near poles) this should fit as we're always looking a specific point on the globe). I'm really curious of the application (as far as you're allowed to) if you don't mind ;) – Tensibai Sep 21 '16 at 18:55
  • I can give a little bit of detail-- the idea is to find all census tracts in the US within a 5-mile proximity to a list of certain facilities. So, the circle is tiny, you're right about that. But my coordinates are lat/lon degrees, so I have to use the Haversine formula to determine linear distance. Going 1 degree West when you're 45 degrees North (e.g. South Dakota) is about an 79 km journey. Doing the same from 30 degrees North (e.g. Texas) is about 96 km. So I maybe mis-spoke; it's more an issue of my inputs being 3D *angular* data and needing to evaluate 2D *linear* data. – i.Mik Sep 21 '16 at 21:12
  • 1
    @I.mik I see how computing a basic difference in lat/long coordinates could be a problem. But IIRC there's packages doing map projection, at least `mapproject` from [here](https://www.google.fr/url?sa=t&source=web&rct=j&url=https://cran.r-project.org/web/packages/mapproj/mapproj.pdf&ved=0ahUKEwijr9-Aw6HPAhVKlxoKHR84AHAQFggbMAA&usg=AFQjCNGJbt-6JmnEh2n2ToOSI8Og-Tm6fA&sig2=-vrkDQjx3I8DYSZGZTwouQ), this would simplify the maths at end. This is one more step, but as it's done once I'm pretty sure it would speed up the overall process. – Tensibai Sep 21 '16 at 22:40
  • ATEOTD if I had to add one more thing: post a new question with dummy datas reflecting your actual dataset with a small text on how you tackle it actually. I'm pretty sure there's guys better than me which would be interested. (and I would be too). if you can show your actual code, so we can bench against it to see if we have a better approach or not ;) – Tensibai Sep 21 '16 at 22:43
1

This is pretty simple to do with your test data and a full outer join:

library(data.table)

setDT(person)
setDT(prize)

person[, JA := 1]
prize[, JA := 1]

merge(person,prize, by = "JA", allow.cartesian = TRUE)[abs(tickets - cost) < 6, .(name, type)]

#       name       type
# 1:  Albert stickyhand
# 2:  Albert   moodring
# 3:  Albert   figurine
# 4:  Albert    whistle
# 5:   Becca   moodring
# 6:   Becca   figurine
# 7:   Becca    whistle
# 8:  Celine   lollipop
# 9:  Celine      yo-yo
# 10:  Celine stickyhand
# 11:  Celine   moodring
# 12: Dagwood      yo-yo
# 13: Dagwood stickyhand
# 14: Dagwood   moodring

What we are doing is a full outer join, and then excluding any rows that do not meet the criteria.

However, if this is a full outer join of 100,000 on 100,000, you may run out of memory with this approach. In this case I would parallelize:

library(data.table)
library(foreach)
library(doParallel)    

setDT(person)
setDT(prize)
person[, JA := 1]
prize[, JA := 1]

seq_s <- seq(1,nrow(person), by = 500) #change the 500 here based on memory/speed tradeoff
ln_s <- length(seq_s)
str_seq <- paste0(seq_s,":",c(seq_s[2:ln_s],nrow(person) + 1) - 1)

cl<-makeCluster(4)
registerDoParallel(cl)

ls<-foreach(i = 1:ln_s) %dopar% {

  library(data.table)
  person_batch <- person[eval(parse(text = str_seq[i]))]
  Output <- merge(person_batch,prize, by = "JA", allow.cartesian = TRUE)
  Output <- Output[abs(tickets - cost) < 6, .(name, type)]
}

stopCluster(cl)

Output <- unique(do.call(rbind,ls))

This is essentially the exact same process, just split into smaller batches that will not run into memory limits because we are filtering on the fly

Chris
  • 6,302
  • 1
  • 27
  • 54
  • Hm, I haven't done parallelization before. I'll have to look into incorporating it in my R toolbox. Even before approaching memory limits though, is this really going to be faster than `lapply()`? Testing it on data that is about 0.1% the size of my final data doesn't seem to be any faster (without parallelizing, at least). Regardless, though, thanks for the intro to parallelization! – i.Mik Sep 20 '16 at 18:54
  • @i.Mik this should be much faster than your initial function - you have lots of initialization and subsetting, while a merge + filter should be quicker. try it with data 1000 rows long and use `microbenchmark` – Chris Sep 20 '16 at 19:13
  • 1
    So, some testing shows that your method is obviously much better than mine for the example data. For some reason, tweaking it to solve my *actual* data, `microbenchmark` shows it running quite a bit *slower*. I'm not sure why that is-- some quirk of my distance formula, perhaps. But that's more my fault than anything, since my example wasn't a perfect analogy to my real data. I'll have to dig in and figure out why that is. Thanks for your answer! – i.Mik Sep 21 '16 at 17:39