-3

The data is sensitive and can't be published, the answer will be in the form of guesswork, which I will be perfectly happy with.

I'm trying to join a larger data.table with a smaller, and update some values. The smaller is loaded from a Excel spreadsheet via XLConnect.

The join is something like this:

d.tmp2 <- left_join(d.main, d.wb, by = c("Nr" = "Nr"))

The code executes but I don't get any matches (only NA:s), which is weird since these sample rows returns TRUE (as it should):

d.wb[1]$Nr == d.main[39]$Nr

[1] TRUE

Edit:

The problem is solved in itself and has to do with how joins in R are handled. The encoding in the data were displayed as "Unknown" but still UTF-8 while the data table imported from Excel was explicitly set to UTF-8. Apperently it's fine to do logical comparisons with different(not really different) encodings but not joins.

Setting the encoding with iconv solved this. Still, I don't understand why logical comparisons and joins differ this much, it makes little sense to me. There is probably something I don't understand working here.

ErrantBard
  • 1,421
  • 1
  • 21
  • 40
  • If we can't [reproduce](http://stackoverflow.com/questions/5963269), it will be just guesswork. – zx8754 Sep 21 '16 at 06:38
  • @zx8754 Yeah I know, but I don't see how I can reproduce it. 1) The data is sensitive, I can't post it publicly 2) The data and system probably needs to be exact for us to reproduce is it and that's not possibly So I happily settle for guesswork - I'm guessing something with encoding. It usually is when the errors are weird in R. – ErrantBard Sep 21 '16 at 07:06
  • I actually got a solution, below. I had to re-set the encoding to UTF-8 in `d.main` with iconv, and then the join worked. I guess the logical comparison is less picky with encoding, while joins are? `d.wb[1]` does in fact return 1 st row and str of `d.wb` returns the different columns and their values as well as the rest it should return – ErrantBard Sep 21 '16 at 07:43
  • why don't you use `data.table::merge` and `by.x` and `by.y` to join your `data.tables` ? – Cath Sep 21 '16 at 08:16
  • @Cath I rather would but I find the dplyr variant easier to read/understand, more lika a sql-join in written syntax. That might have changed in a dev version? If the data.tables were large I would do it, even so. – ErrantBard Sep 21 '16 at 08:21
  • 2
    well I don't see the point of loading `dplyr` to do a join on `data.tables` when `data.table::merge` does what you need : for a left join : `merge(d.main, d.wb, all.x=TRUE, by.x = "Nr", by.y = "Nr")` though here you can just put `by="Nr"` as both colnames are the same – Cath Sep 21 '16 at 08:24
  • 1
    +1 You'r right of course, in a future commit I might change it since dplyr-join is used quite a lot in that script. – ErrantBard Sep 21 '16 at 09:07

2 Answers2

1

After som testing I found the answer, I had to create a new column in d.main, again convert to UTF-8 using iconv and then join on the new variable. Then it worked.

So the error was with encoding, but I still don't see why the result from the logical comparison and dplyr's join are so different...

ErrantBard
  • 1,421
  • 1
  • 21
  • 40
0

with dplyr

d.tmp2 <- left_join(d.main, d.wb, by = "Nr")

with base R

d.tmp2 <- merge(d.main, d.wb, by = "Nr")

should work.

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • Yeah I know - it should work but it doesn't. This is weirding me out - is the logical comparison less sensitive than dplyr or base joins, I just tried setting some of them explicitly and that worked as well (d.main[Nr == d.wb[19]$Nr, var_a := new value]) – ErrantBard Sep 21 '16 at 06:17