0

I'm struggling with joining two data sets

 #df1

  id   name1
   1    a
   2    b
   3    c

and

   $df2

  id     name2
  1       c
  2       d

I try to join them by their id

  library(dplyr)


 result <- left_join(df1, df2, by="id")

it gives me the following error

Error: cannot join on columns 'id' x 'id': Can't join on 'id' x 'id' because of incompatible types (factor / integer)

because they have different classes:

 sapply(df1, class)
        id       name1
     "factor"       "factor"    


sapply(df2, class)
        id       name2
     "integer"       "factor"

I tried to change the classes to make them similar

 df1$id <- as.integer (df1$id)

but , it doesn't help to find the common rows in two datasets. ( it can not recognize similar "id"s in df2)

MFR
  • 2,049
  • 3
  • 29
  • 53
  • Please be more detailed, rather than just saying "this doesn't work". Does it return an error? Is the output different than what you expected (if so, how was it different)? – rosscova Oct 18 '16 at 00:05
  • @rosscova I added the error, it does not give ma an output – MFR Oct 18 '16 at 00:10
  • That looks like the error before you changed the class of `df1$id` to integer, how about after you made that change? – rosscova Oct 18 '16 at 00:12
  • 1
    Could you try with `df1$id <- as.integer (as.character(df1$id))` – Silence Dogood Oct 18 '16 at 00:13
  • use dput to save df1 and df2, so that others will be able to reproduce – user31264 Oct 18 '16 at 00:15
  • @Osssan thank you so much it worked. Do you know why? – MFR Oct 18 '16 at 00:24
  • Sorry guys for not being clear. @rosscova when I convert to integer it doesn't give me any error but it does not find similar "id"s in df2. – MFR Oct 18 '16 at 00:25
  • Ahh, of course, good pick-up @Ossan (turn it into an answer below?). When converting from `factor` to `integer`, R will extract the factor reference number, instead of your data. @Ossan's suggestion gets around that by forcing R to convert to character first (which gets your data) before converting to integer. – rosscova Oct 18 '16 at 00:27
  • I see. thanks for clear explanation rosscova. @osssan I am happy to accept your answer if you post it below. – MFR Oct 18 '16 at 00:29

2 Answers2

1

From help page: as.numeric(levels(f))[f] is recommended instead of as.numeric(as.character(f)).

The issue with factor => numeric/integer conversion has been comprehensively answered by @Joshua Ulrich here.

Seek and ye shall find but user needs to know what to look for to reach the answer.

The Warning message in documentation for ?factor

The interpretation of a factor depends on both the codes and the "levels" attribute. Be careful only to compare factors with the same set of levels (in the same order). In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).

This step could be avoided by ensuring stringsAsFactors=FALSE while reading input data to side-step conversion of character variables to factors unless they are absolutely essential i.e. when levels of factors are required in analysis.

Community
  • 1
  • 1
Silence Dogood
  • 3,587
  • 1
  • 13
  • 17
0

I was running into the same problems with just going from characters to numeric and joining tables. I tried to go to numbers and it didn't work, even with the above method.

I had to go to as.integers(levels(df1$id))[df1$id] to make it work.

I tried using as.numeric(levels(df1$id))[df1$id] and it would take all of my values to NA.

Hope this helps!

Luke Holcomb
  • 165
  • 1
  • 8