r - merge is only working if I do it twice?

Question

Can someone please explain the following output?

> "naics_new_code__c"%in%names(df)
[1] TRUE
> names(states)
[1] "application_state__c" "naics_new_code__c"    "loan_id"   "wa_credit_score__c"  
> df= merge(df,states,by = "loan_id")
> "naics_new_code__c"%in%names(df)
[1] FALSE
> df= merge(df,states,by = "loan_id")
> "naics_new_code__c"%in%names(df)
[1] TRUE

So, as you can see, on the first merge, the field "naics_new_code__c" does not become attached to my df. However, on the second merge, which is completely redundant, it does. Why would this be happening?

NOTE: this is a theoretical question about r. Adding a reproducible example would not only be superfluous in this case, but would make the answer less general and efficient for someone else with a similar problem to look up and answer for themselves.

Best,

Paul

Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — MrFlick, Jul 01 '15 at 00:02
@Shivam - I think we're all aware of what `names()` does. Knowing the *result* of `names(df)` in the current context would be helpful though. Changing `<-` to `=` for assignment in this case will almost certainly change nothing too. — thelatemail, Jul 01 '15 at 04:10
It turns out hat "naics_new_code__c" was already in names(df), as thelatemail and Alex suggested. So on the first merge it created naics_new_code__c.x and naics_new_code__c.y, so that on the second merge it works correctly. I have added the result of "naics_new_code__c"%in%names(df) at first to my example. I am not adding a reproducible example both because it involves proprietarry information and because I firmly believe it is unneccesary in this case to solve the problem, as Alex and thelatemail have proven. I know you need standards, but come on, not all questions are the same. — Paul, Jul 02 '15 at 16:37
A reproducible example **using your proprietary data** is certainly **not necessary**. But demonstrating the problem by providing a reproducible example of built-in or simulated data is expected of a good question, and would probably convert some of your down-votes to up-votes. — Gregor Thomas, Jul 02 '15 at 16:49
The first step of attempting to answer this question is to try to replicate the problem, which takes a little bit of work. This can be done efficiently if the asker provides reproducible code; doing so encourages high quality answers by showing that the asker is willing to put in a little work themself and making things easier for potential answerers. Conversely, a question that *doesn't* provide a reproducible example makes it seem like the asker doesn't care much, can't be bothered to demonstrate the problem, and expects potential answerers to do all the legwork. — Gregor Thomas, Jul 02 '15 at 16:55
Lastly, a reproducible example isolating the problem is a great first step when debugging. Based on your comments and edits, creating a reproducible example from the start and reading the documentation would make the question unnecessary. — Gregor Thomas, Jul 02 '15 at 17:00
@ gregor, While I think your first point can depend on circumstance and sometimes the "reproduction of the problem" can be left in the abstract, I do see the validity of your second point -- if I had made a reproducible example I likely would have answered my own question. But since not everyone does this legwork every time they run into a bug but instead go to stack overflow, doesn't the question still have some value to the community? and given that, shouldn't I pose the question in the simplest, most general way that is still clear? — Paul, Jul 02 '15 at 17:47
sorry I ran out of time. Just wanted to add: I asked a concise question, and got a concise answer. Isn't that ok sometimes? — Paul, Jul 02 '15 at 17:54
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/82226/discussion-between-gregor-and-paul). — Gregor Thomas, Jul 02 '15 at 18:15

score 0 · Accepted Answer · answered Jul 02 '15 at 17:02

From the documentation of merge:

If the columns in the data frames not used in merging have any common names, these have suffixes (".x" and ".y" by default) appended to try to make the names of the result unique. If this is not possible, an error is thrown.

Based on your names results before and after, this seems to be the case.

r - merge is only working if I do it twice?

1 Answers1