0

I have two dataframes:

info
Fname  Lname
Henry      H
 Rose      R
Jacob      T
 John      O
 Fred      Y
Simon      S
  Gay      T

And

students
Fname  Lname  Age  Height  Subject Result
Henry      H   12      15 Math;Sci      P
 Rose      R   11      18 Math;Sci      P
Jacob      T   11      15 Math;Sci      P
Henry      H   11      14 Math;Sci      P
 John      O   12      13 Math;Sci      P
 John      O   13      16 Math;Sci      F
 Fred      Y   11      16      Sci      P
Simon      S   12      10 Eng;Math      P
  Gay      T   12      11 Math;Sci      F
 Rose      R   15      18 Math;Sci      P
 Fred      Y   12      16 Math;Sci      P

I want to do a JOIN and get all the names from info and find its relevant metadata from students. But only pick the one with the highest age (when Fname and LName are equal). My output should look like:

Final
Fname Lname Age Height  Subject Result
Henry     H  12     15 Math;Sci      P
 Rose     R  15     18 Math;Sci      P
Jacob     T  11     15 Math;Sci      P
 John     O  13     16 Math;Sci      F
 Fred     Y  12     16 Math;Sci      P
Simon     S  12     10 Eng;Math      P
  Gay     T  12     11 Math;Sci      F

I have tried sqldf but with no luck yet. I'm just not able to get the identifiers correctly. Is there any other way I can get my output?

CuriousBeing
  • 1,592
  • 14
  • 34
  • Possible duplicate of [How to join (merge) data frames (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) – WoodChopper Oct 19 '15 at 06:48

3 Answers3

2

Here is a perhaps less than elegant way, using base R.

Now, merge the frames on the names (although there is little point in doing so here in this example; it is really just a list of names already in the students frame).

merged_df <- merge(students,info,by=c("Fname","Lname"))

Finally, aggregate, here just on names. You could add any categorical or factor variables.

merged_df_max <-aggregate(
                merged_df[c('Age','Height')], 
                by=list(Fname = merged_df$Fname,
                        Lname = merged_df$Lname), 
                FUN=max, na.rm=TRUE)

## add back details to the merged df
result <- merge(merged_df_max,students,by=c("Fname","Lname","Age","Height"))

To create data.frame from file,

## load data
lines <-"
Fname,Lname,Age,Height,Subject,Result
Henry,H,12,15,Math;Sci,P
Rose,R,11,18,Math;Sci,P
Jacob,T,11,15,Math;Sci,P
Henry,H,11,14,Math;Sci,P
John,O,12,13,Math;Sci,P
John,O,13,16,Math;Sci,F
Fred,Y,11,16,Sci,P
Simon,S,12,10,Eng;Math,P
Gay,T,12,11,Math;Sci,F
Rose,R,15,18,Math;Sci,P
Fred,Y,12,16,Math;Sci,P
"

lines2 <-"
Fname,Lname
Henry,H
Rose,R
Jacob,T
John,O
Fred,Y
Simon,S
Gay,T
"

con <- textConnection(lines)
students <- read.csv(con,sep=',')
con2 <- textConnection(lines2)
info <- read.csv(con2,sep=',')
close(con)
close(con2)
WoodChopper
  • 4,265
  • 6
  • 31
  • 55
ako
  • 3,569
  • 4
  • 27
  • 38
1

Using dplyr:

library(dplyr)

info %>% left_join(students) %>%
    group_by(Fname, Lname) %>%
    filter(Age == max(Age))
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
1

Try this:

library(sqldf)
sqldf("select Fname, Lname, max(Age) Age, Height, Subject, Result 
       from info left join students using (Fname, Lname)
       group by Fname, Lname")

We used a left join in case there are students in info with no data in students. In the question the students in info and students are the same so we could have omitted the word left in the query and still have gotten the same result. Also note that because the exact same set of students appears in both info and students we would not need to use info at all. This is the same as the last query except for the from line but gives the same answer with the data provided:

sqldf("select Fname, Lname, max(Age) Age, Height, Subject, Result 
       from students
       group by Fname, Lname")

Note: For reproducibility the following constructs the info and student data frames. Please provide this yourself in the future when asking questions on SO.

Lines_info <- "
Fname  Lname
Henry      H
 Rose      R
Jacob      T
 John      O
 Fred      Y
Simon      S
  Gay      T
"
Lines_students <- "
Fname  Lname  Age  Height  Subject Result
Henry      H   12      15 Math;Sci      P
 Rose      R   11      18 Math;Sci      P
Jacob      T   11      15 Math;Sci      P
Henry      H   11      14 Math;Sci      P
 John      O   12      13 Math;Sci      P
 John      O   13      16 Math;Sci      F
 Fred      Y   11      16      Sci      P
Simon      S   12      10 Eng;Math      P
  Gay      T   12      11 Math;Sci      F
 Rose      R   15      18 Math;Sci      P
 Fred      Y   12      16 Math;Sci      P
"

info <- read.table(text = Lines_info, header = TRUE)
students <- read.table(text = Lines_students, header = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341