0

I like to merge two datasets (Generaldata, monthlyresults) but at the same time there schould be another condition fulfilled (Startdate=colnames_of_monthlyresults).

Example (please see codes at the end of the question):

Generaldata columns: Code; Startdate (e.g.201601)

Code: 1,2,3
Startdate: 201601, 201511, 201512

Monthlyresults columns: Code; Result201201 (e.g 5%) ; Result201202; ... ; Result2011604

Code: 1,2,3
Result201511: 0, 20 , 0 
Result201512: 6, 30, 5
Result201601: 8, 40, 10

Desired result: Code; Result_at_startdate

Code:  1,2,3
Result_startdate: 8, 20, 5

I tried to do this in two steps: First I merged both dataset with the following Code:

Testresults<- merge(x=Generaldata, y=Monthlyresults, by.y = "Code", by.x= "Code", 
                    all.x = TRUE, incomparables = NA)

As a result I received a dataframe with too much columns (results for every month over 4 years), but I am only interested in the value with the result of every Person at the startdate (and enddate later). How could I achieve this?

I tried it to include both conditions in the by= element of the merge function, but the problem is that I need to refer only to the columnames and not the rowentries in the result dataset. I also tried it with a ifelse function, but then I have the problem that the format is not identical. The entry of startdate column ("201601") is not equal to the colname ("R201601", because a columnname cannot start with a number). In Excel I would probably combine the if and vlookup function to achieve it but my dataset is to large to work with Excel.

I'm an R beginner and would be very happy if someone can help me.

Example R Codes:

Generaldata<- data.frame("Code"=c(1,2,3), "Startdate"= c(201511, 201512, 201601))

Monthlyresults<- data.frame ("Code"=c(1,2,3), "R201511"=c(0,20,0), "R201512"=c(6,30,5), "R201601"=c(8,40,10) )

Testresult <-merge(x=Generaldata, y=Monthlyresults, by.y = "Code", by.x= "Code", all.x = TRUE, incomparables = NA)
#Testresult have all columns of Monthlyresults, but I like to get only the result of every person at the startdate

Desired_result<- data.frame ("Code"=c(1,2,3), "Result_startdate"= c(8,20,5))
show(Desired_result)
Hugo
  • 25
  • 1
  • 1
  • 5
  • 2
    Please show some reproducible example and expected output – akrun Jun 06 '16 at 13:35
  • @akrun: Thanks for the hint. I add an example with numbers. Person 1 started at 2016-01 with a value of 8. – Hugo Jun 06 '16 at 13:47
  • 2
    Please have a look [at this link](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) as to how to create a reproducible example. – Sotos Jun 06 '16 at 13:51
  • 1
    @Sotos: Thanks for providing the helpful link (I'm new on stackoverflow). I add R codes for a reproducible example at the end of my question. – Hugo Jun 06 '16 at 14:29

1 Answers1

0

Here is a dplyr/tidyr answer

library(dplyr)
library(tidyr)
library(stringi)

Monthlyresults %>%
  gather(Startdate, value, -Code) %>%
  mutate(Startdate = 
           Startdate %>%
           stri_sub(2) %>%
           as.numeric) %>%
  right_join(Generaldata)
bramtayl
  • 4,004
  • 2
  • 11
  • 18
  • Great! It works fine for my testdata. Could you please explain what does the "stri_sub(2)" ? I did not really understand how you refer to the results columns and therefore could not apply it to my real data. Thank you very much for your answer – Hugo Jun 07 '16 at 09:07
  • I tried to apply your code to my real data, but I only get NAs. Do you have an idea why? Joining by: c("Code", "Startdate") Warning message: In function_list[[k]](value) : NAs introduced by coercion – Hugo Jun 07 '16 at 11:06
  • Sorry, stri_sub is a function from stringi. I added it as a library. It will subset the string starting with the second character. In this case, it will remove the R's – bramtayl Jun 07 '16 at 16:36
  • I'm wondering if the NA's are introducted when trying to convert something non-numeric to numeric. I converted to numeric because the startdates in your generaldata were numeric. – bramtayl Jun 07 '16 at 16:37
  • Thanks for your support :) I solved the NA issue and everything works fine now. The column names of my real data were not in an equal format and after adjusting that I had no NA values. Sorry that I did not discovered it earlier. I am very greatful for your help. – Hugo Jun 07 '16 at 23:56