Merge 2 dataframes if value within range

Question

I have been struggling with this for some time now and couldn't find any way of doing it, so I would be incredibly grateful if you could help! I am a novice in programming and my code is probably inefficient, but this was the best I could come up with.

Basically, I have 2 .csv files (fixes.csv and zones.csv) which contain different variables and have different numbers of rows and columns. The first file fixes.csv contains eye movement data recorded during an experiment and looks something like this:

Order Participant Sentence Fixation StartPosition
1       1          1         1       -6.89
2       1          1         2       -5.88
3       1          1         3       -5.33
4       1          1         4       -4.09
5       1          1         5       -5.36

This contains eye movement recordings made during sentence reading. What happens is that each of 20 participants reads a set of 40 12-word sentences, making several fixations on different words in each sentence, and sometimes going back to look at previously read words. The StartPosition column contains the position on the screen (in degrees of visual angle) where the fixation started. Values are generally between -8deg and 8deg.

The second file zones.csv contains information about the sentences. Each of the 40 sentences contains 12 words, and each word forms one zone of interest. zones.csv looks something like this:

Sentence     Zone  ZoneStart   ZoneEnd
  1           1     -8.86      -7.49
  1           2     -7.49      -5.89
  1           3     -5.88      -4.51
  1           4     -4.51      -2.90

ZoneStart and ZoneEnd indicate the starting and ending coordinates of each zone on the screen (in deg of visual angle). Because the words in each sentence are different, each zone has a width.

What I would like to do is use both files simultaneously in order to assign zone numbers from zones.csv to fixations from fixes.csv. So for example, if the first fixation starting position in Sentence 1 falls within the range of Zone 1, I want the value 1 to be assigned to it so that the end file looks something like this:

Order Participant Sentence Fixation StartPosition Zone
1       1          1        1        -6.89          2
2       1          1        2        -5.88          2
3       1          1        3        -5.33          3
4       1          1        4        -4.09          3
5       1          1        5        -5.36          3

What I have tried so far is using a loop to automate the process.

zones = read.csv(file.choose(), header = TRUE, sep = ",")
fixes = read.csv(file.choose(), header = TRUE, sep = ",")

fixes$SentNo = as.factor(fixes$SentNo)
zones$Sentence = as.factor(zones$Sentence)
zones$Zone = as.factor(zones$Zone)

nfix = nrow(fixes) ## number of fixations in file fixes.csv
nsent = nlevels(fixes$Sentence) ## number of sentences in data file fixes.csv
nzs = nlevels(zones1$Zone) ## number of zones per sentence from file zones.csv
nsz = nlevels(zones$Sentence) ## number of sentences in data file zones.csv

fixes$Zone = 0

for (i in c(1:nfix)){
  for (j in c(1:nzs)){
    for (k in c(1:nsent){
      for (l in c(1:nsz)){ 
        while(fixes$Sentence[k] == zones$Sentence[l]){
          ifelse(fixes$StartPosition[i] > zones$ZoneStart[j]  
          & fixes$StratPosition[i] < zones1$ZoneEnd[j], 
          fixes$Zone[i] -> zones1$Zone[j], 0)
        return(fixes$Zone)
}
}
}
}

But this just returns loads of zeros, rather than assigning a zone number to each fixation. Is it even possible to use 2 separate .csv files in this way when they have different numbers of rows and columns? I tried merging them by Sentence and working from a large combined file, but that didn't help, as it seemed to mess up the order of fixations in one file and the order of zones in the other.

Any help would be greatly appreciated!

Thank you!

As per this [SO answer](https://stackoverflow.com/questions/66372705/non-equi-join-in-tidyverse), dplyr now can support non-equi joins in the development version. — hnagaty, Jan 18 '23 at 11:42

score 7 · Answer 1 · edited Jun 20 '20 at 09:12

With version v1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to perform non-equi joins and range joins:

library(data.table)
setDT(fixes)[setDT(zones), 
             on = .(Sentence, StartPosition >= ZoneStart, StartPosition < ZoneEnd), 
             Zone := Zone][]

   Order Participant Sentence Fixation StartPosition Zone
1:     1           1        1        1         -6.89    2
2:     2           1        1        2         -5.88    3
3:     3           1        1        3         -5.33    3
4:     4           1        1        4         -4.09    4
5:     5           1        1        5         -5.36    3

Data

fixes <- readr::read_table(
  "Order Participant Sentence Fixation StartPosition
  1       1          1         1       -6.89
  2       1          1         2       -5.88
  3       1          1         3       -5.33
  4       1          1         4       -4.09
  5       1          1         5       -5.36"
)
zones <- readr::read_table(
  "Sentence     Zone  ZoneStart   ZoneEnd
  1           1     -8.86      -7.49
  1           2     -7.49      -5.89
  1           3     -5.88      -4.51
  1           4     -4.51      -2.90"
)

Michael Lawrence · Answer 2 · 2014-05-29T21:42:58.073

There is a package in Bioconductor called IRanges that does what you want.

First, form an IRanges object for your zones:

zone.ranges <- with(zones, IRanges(ZoneStart, ZoneEnd))

Next, find the overlaps:

zone.ind <- findOverlaps(fixes$StartPosition, zone.ranges, select="arbitrary")

Now you have indices into the rows of the zones data frame, so you can merge:

fixes$Zone <- zones$Zone[zone.ind]

Edit: Just realized you have floating point values, while IRanges is integer-based. So you would need to multiply the coordinates by 100, given your precision.

score 2 · Answer 3 · answered May 29 '14 at 14:10

You can use sqldf package:

library(sqldf)

#dummy data
fixes <- read.table(text="
Order Participant Sentence Fixation StartPosition
1       1          1         1       -6.89
2       1          1         2       -5.88
3       1          1         3       -5.33
4       1          1         4       -4.09
5       1          1         5       -5.36 
",header=TRUE)
zones <- read.table(text="
Sentence     Zone  ZoneStart   ZoneEnd
1           1     -8.86      -7.49
1           2     -7.49      -5.89
1           3     -5.88      -4.51
1           4     -4.51      -2.90
",header=TRUE)

#output merged result
res <- 
  sqldf("SELECT [Order],Participant,f.Sentence,Fixation,StartPosition,Zone
       FROM fixes f,zones z
       WHERE f.Sentence=z.Sentence AND
             f.StartPosition>=z.ZoneStart AND
             f.StartPosition<z.ZoneEnd")

score 2 · Answer 4 · answered Apr 12 '22 at 20:04

Adding a simple dplyr solution in case it's useful to others. The data from this question was like this:

df1<- data.frame(Order=1:5, Participant=rep(1,5), 
      Sentence=rep(1,5), Fixation=1:5,StartPosition=c(-6.89,-5.88, -5.33,-4.09,-5.36))
df2<- data.frame(Sentence=rep(1,4),Zone=1:4,ZoneStart=c(-8.86,-7.49,-5.88,-4.51),
      ZoneEnd=c(-7.49,-5.89,-4.51,-2.90))

With dplyr, you can use full_join() in combination with filter() to join the desired value from the range.

 library(dplyr)
 df1 |> full_join(df2, by="Sentence") |> 
        filter(StartPosition >= ZoneStart, StartPosition <= ZoneEnd ) |>
        select(-c(ZoneStart,ZoneEnd))

Result:

 Order Participant Sentence Fixation StartPosition Zone
1     1           1        1        1         -6.89    2
2     2           1        1        2         -5.88    3
3     3           1        1        3         -5.33    3
4     4           1        1        4         -4.09    4
5     5           1        1        5         -5.36    3

Señor O · Answer 5 · 2014-05-29T14:10:17.037

I think the best approach is to change zones to a more friendly format for what you're doing:

ZoneLookUp = lapply(split(zones, zones$Sentence), function(x) c(x$ZoneStart, x$ZoneEnd[nrow(x)]))

#$`1`
#[1] -8.86 -7.49 -5.88 -4.51 -2.90

Then you can easily look up each zone:

fixes$Zone = NULL
for(i in 1:nrow(fixes))
    fixes$Zone[i] = cut(fixes$StartPosition[i], ZoneLookUp[[fixes$Sentence[i]]], labels=FALSE)

If performance is an issue, you can take a (only) slightly less simple approach using by or data.table with by.

Merge 2 dataframes if value within range

5 Answers5

Data

Linked

Related