0

I think the solution to my problem is quite basic, but given that I am an R starter, I still find it difficult to correctly combine the different steps (in Excel, I could easily do all this in seconds). However, let's look at two simplified datasets:

#Set up df1
a <- c(1000, 1000, 1001, 1001)
b <- c(9000, 9001, 9010, 9011)
c <- c("Apple", "Apple", "BP", "BP")
df1 <- data.frame(a,b,c)
names(df1) <- c("FirmNo", "StockNo", "FirmName")

and

#Set up df2
one <- c(9000, 9000, 9001, 9001, 9010, 9010, 9011, 9011, 9011)
two <- c(5, 15, 100, 200, 20, 80, NA, 850, 800)
df2 <- data.frame(one,two)
names(df2) <- c("StockNo", "Volume")

Which result into

  FirmNo StockNo FirmName
1   1000    9000    Apple
2   1000    9001    Apple
3   1001    9010       BP
4   1001    9011       BP

and

  StockNo Volume
1    9000      5
2    9000     15
3    9001    100
4    9001    200
5    9010     20
6    9010     80
7    9011     NA
8    9011    850
9    9011    800

respectively. What I would like to achieve is the following: Enrich df1 with the average volume based on the volume figures in df2, so that the result looks like this:

  FirmNo StockNo FirmName AvgVol
1   1000    9000    Apple     10
2   1000    9001    Apple    150
3   1001    9010       BP     50
4   1001    9011       BP    825

Please mind that the actual tables are much longer, but the structure is essentially the same. I could assume that a few lines - maybe even with dplyr - could easily do the job. I already found the following solution, but think that there is something much more elegant:

df2 <- df2 %>%
  group_by(StockNo) %>%
  mutate(AvgVol = mean(Volume, na.rm=TRUE)) #Add AvgVol column with respective data

df2 <- distinct(df2, StockNo, avgVol) #Reduce dataset to individual lines

df1 <- merge(df1, df2, by="StockNo") #Merge df1 and df2

Thanks a lot for your help in advance and kindly come back to me in case I did not explain myself well.

chuesker
  • 25
  • 2

2 Answers2

2

I would suggest next code using dplyr. You can avoid mutate() and duplicated rows using summarise():

library(dplyr)
#Data
#Set up df1
a <- c(1000, 1000, 1001, 1001)
b <- c(9000, 9001, 9010, 9011)
c <- c("Apple", "Apple", "BP", "BP")
df1 <- data.frame(a,b,c)
names(df1) <- c("FirmNo", "StockNo", "FirmName")
#Set up df2
one <- c(9000, 9000, 9001, 9001, 9010, 9010, 9011, 9011, 9011)
two <- c(5, 15, 100, 200, 20, 80, NA, 850, 800)
df2 <- data.frame(one,two)
names(df2) <- c("StockNo", "Volume")
#Compute
dfnew <- df1 %>% left_join(df2 %>%
                    group_by(StockNo) %>%
                    summarise(AvgVol = mean(Volume, na.rm=TRUE)))

Output:

  FirmNo StockNo FirmName AvgVol
1   1000    9000    Apple     10
2   1000    9001    Apple    150
3   1001    9010       BP     50
4   1001    9011       BP    825
Duck
  • 39,058
  • 13
  • 42
  • 84
0

Same summarise-by-group then join logic, in data.table

library(data.table)
setDT(df1)
setDT(df2)

df1[df2[, .(AvgVol = mean(Volume, na.rm = TRUE)), by = StockNo], 
    on = .(StockNo)]
#    FirmNo StockNo FirmName AvgVol
# 1:   1000    9000    Apple     10
# 2:   1000    9001    Apple    150
# 3:   1001    9010       BP     50
# 4:   1001    9011       BP    825
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38