1

I have the below code:

options(java.parameters = "-Xmx4000m")
require(xlsx)
library(plyr)

setwd("~/PycharmProjects/CatScrape")
rm(list=ls(all=TRUE))
jgc <- function() .jcall("java/lang/System", method = "gc")

Master <- read.xlsx2("MASTER.xlsx", sheetIndex = 2, startRow = 1, colIndex=4,endRow = 10000, as.data.frame = TRUE, header=TRUE)

Dutch_Stage <- read.xlsx2("languages/Dutch.xlsx", sheetIndex = 1, startRow = 1, colIndex=c(5,8),endRow = 10000, header=TRUE)
Dutch <- unique(Dutch_Stage)
rm(Dutch_Stage)
Dutch <- rename(Dutch, c("Key.s."="Key", "Status"="Dutch"))
jgc()
output <- merge(Master, Dutch, by="Key", all.Master = TRUE)

## OUTPUT RECORD NUMBER MATCHES MASTER

Finnish_Stage <- read.xlsx2("languages/Finnish.xlsx", sheetIndex = 1, startRow = 1, colIndex=c(5,8),endRow = 10000, header=TRUE)
Finnish <- unique(Finnish_Stage)
rm(Finnish_Stage)
Finnish <- rename(Finnish, c("Key.s."="Key", "Status"="Finnish"))
jgc()
output <- merge(output, Finnish, by="Key", all.output = TRUE)


## OUTPUT RECORD NUMBER INCREASES by 6

I have 12 more files to add, and when that happens, I end up with 25 times the number of records.

In this case, all.output = TRUE is set to all of the files, and my goal is to just show the records from Master, and the associations to those records. I don't want the additional records.

This makes me think this is not a true "left join". How do I make it just a "LEFT JOIN"?

Thanks

arcee123
  • 101
  • 9
  • 41
  • 118
  • 1
    See https://www.r-bloggers.com/joining-data-frames-in-r/ . I believe you want `merge(..., all.x=TRUE)` – aichao Oct 07 '16 at 17:55
  • 2
    I think maybe you've just badly misread the documentation for `merge`. When the documentation says that an argument is called `all.x` that means the argument is actually `all.x`, not `all.`. – joran Oct 07 '16 at 17:56
  • ok. tried it. same result. what else you got? – arcee123 Oct 07 '16 at 18:01
  • 4
    If your "right side" data frame has duplicated key values, it will duplicate rows in your "left side" data frame. That's how a left join works. – joran Oct 07 '16 at 18:07
  • 2
    Getting more rows back does not give any indication that '*this is not a true "left join".*' Even inner joins will return more rows if there are multiple matches. – Gregor Thomas Oct 07 '16 at 18:28
  • @dww I was about to propose that duplicate too, but I think the real problem here isn't how to do a left join but OP's understanding of what a left join is. - [LEFT JOIN vs LEFT OUTER JOIN](http://stackoverflow.com/a/4401540/903061) might be a better dupe. – Gregor Thomas Oct 07 '16 at 18:30

2 Answers2

4

A little too long for a comment so I'll give my two cents in an answer:

Overall its hard to answer your question directly without sample data. But, generally if you want a left join you should do:

merge(x,y,all.x=T). 

The above code is saying keep all the observations from your x dataset (your left dataset)

However, I suspect that that this won't solve your issue since you are getting MORE and not LESS observations than expected. I think your issue is likely with unique(Finnish_Stage). Using this unique statement doesn't necessarily mean your Key variable only has 1 observation for each value (if you have other variables mapped to the same key value that could be an issue). For example try this: unique(data.frame(Key=c(1,1,2),value=c(1,2,3)))

Joins (including left joins) will merge everything together. Maybe someone else can explain this in words slightly better, but I think an example is the best way to show what happens:

Data:

d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6),y3=c(7,8,9))
d2 <- data.frame(y1=c(1,1,2))

Original Join (inner join)

merge(d1,d2,all.output=TRUE)

  y1 y2 y3
1  1  4  7
2  1  4  7
3  2  5  8

Here you can see that we have a duplicate row. This is because there were two y1=1 values in your d2 dataset so the join will merge these on twice.

The same thing also happens for a left join:

Left join

merge(d1,d2,all.x=TRUE)

  y1 y2 y3
1  1  4  7
2  1  4  7
3  2  5  8
4  3  6  9

So what's likely happening is you have multiple observations for each of your Key values which is then causing these to be merged on multiple times giving you more observations than you expect.

To solve this, I just make sure you should only ever see 1 value of your Key variable in the dataset you want to merge on (and if do expect this, make sure you nodupe based on only the Key variable). If you would expect to have more than 1 Key value, then the merge is actually working the way it should.

tldr: Left joins don't have to return the same number of rows as the original dataset.

Mike H.
  • 13,960
  • 2
  • 29
  • 39
  • that's what did it. My guys who gave me new excel sheets pre-duped the keys before the script got it's hands on it. – arcee123 Oct 07 '16 at 18:41
0

in dplyr: (make sure to import after plyr [as many of the earlier plyr functions need to be deprecated])

library(dplyr)

output <- 
    Master %>% 
    mutate(Key.s = Key) %>% 
    left_join(Dutch) %>% 
    left_join(Finnish) %>%
    distinct
leerssej
  • 14,260
  • 6
  • 48
  • 57