0

I have two data sets. One has duplicate location names, another one has location name and its lat long. I want to add lat long to first data set where the location name matches in second data set.

df1

 Code   Station Name
 BNC    BANGALORE CANT 
 BNC    BANGALORE CANT 
 BNC    BANGALORE CANT    
 BBS    BHUBANESWAR    
 BBS    BHUBANESWAR    

df2

Station Name     lat       lon
BANGALORE CANT  12.993690  77.59814
BHUBANESWAR     20.296059  85.82454

expected result df1

 Code   Station Name        lat        lon
 BNC    BANGALORE CANT     12.993690  77.59814
 BNC    BANGALORE CANT     12.993690  77.59814
 BNC    BANGALORE CANT     12.993690  77.59814  
 BBS    BHUBANESWAR        20.296059  85.82454
 BBS    BHUBANESWAR        20.296059  85.82454

It is obvious that object length differs. And I need the df1 duplicate entry. cbind won't work as length differs. same for ifelse and if with sapply. Getting lat long for df1 using geocode fails as the list is long (query max exceeded). Any thoughts?

Sanju
  • 5
  • 1
  • 1
  • 5
  • 1
    You can just merge on station name. Something like `df3 <- merge(df1, df2, by="station name")`. – lmo Jul 06 '17 at 15:39

2 Answers2

0

It looks like a SQL problem. So in R the library dplyr is the best choice. Here a working example, the name of the column has to be equal for the join

library(dplyr)
iris
iris2<-data.frame("Species"=unique(iris$Species),"n"=c(1,2,3))
iris %>% inner_join(iris2)

PS you can share your db use dput(dataframename) command, It is really easier to help with relevant examples

Federico Manigrasso
  • 1,130
  • 1
  • 7
  • 11
0

You can use dplyr::left_join for exactly this kind of problem. dplyr has a useful vignette that covers joins or you can check out this general backgrounder on different kinds of joins.

# Load dplyr and other useful packages
library(tidyverse)

# Generate demo dataset
df1 <-
  tibble(
    Code = c("BNC", "BNC", "BNC", "BBS", "BBS"),
    "Station Name" = c(
      "BANGALORE CANT",
      "BANGALORE CANT",
      "BANGALORE CANT",
      "BHUBANESWAR",
      "BHUBANESWAR"
    )
  )

df2 <- tibble("Station Name" = c("BANGALORE CANT", "BHUBANESWAR"),
              lat = c(12.993690, 20.296059),
              lon = c(77.59814, 85.82454))

# Join df1 and df2
left_join(df1, df2)

This assumes that both datasets share a column with exactly the same name. You can use the by= argument if you want to exercise more control over how the two datasets are combined with each other.

Andrew Brēza
  • 7,705
  • 3
  • 34
  • 40
  • 1
    it works with by. I need to read about dplyr more. thank you – Sanju Jul 06 '17 at 15:01
  • You're welcome! Learning about `dplyr` is one of the best ways to improve your data science skills. – Andrew Brēza Jul 06 '17 at 15:29
  • 2
    @Sanju There is absolutely no need "*to read about dplyr*" in this case. This is a very very basic operation that R can easily perform without loading any dependencies (especially such a huge dependency such as tidyverse which loads tons of stuff that you probably won'e need. See the duplicate target of your question was closed with. In R, you will simply do `merge(df1, df2)` or `merge(df1, df2, all.x = TRUE)` (depending on the type of join you need ( see `?merge`) and you all set. – David Arenburg Jul 07 '17 at 03:57
  • @DavidArenburg You're right that `merge` can easily handle this specific situation. I suggest `dplyr` to newer R users because its tools are more consistent than base R and in my opinion easier to learn. – Andrew Brēza Jul 07 '17 at 11:58
  • "*More consistent*"? In what sense? That they change completely in each build? If something you can tell about dplyr for sure is that it is inconsistent with even it's own previous build. – David Arenburg Jul 07 '17 at 13:45
  • To each his own, one of the best and worst things about R is that there are a thousand ways to do everything. I make my interns learn `dplyr` because I think it is the simplest to understand, both for beginning coders and for people who have to read their work. Imagine the OP wanted to see stations that appeared at least three times arranged from most common to least. Base R can do that easily but the `dplyr` version is easier to learn and debug: http://www.r-fiddle.org/#/fiddle?id=V5kMVTkQ – Andrew Brēza Jul 07 '17 at 18:30