0

I'm a trying-to-be R user. I never learned to code properly and have been just doing it by finding stuff online.

I encountered a problem that I would need some of you experts' help.

I have two data files.

  1. Particulate matter (PM) concentrations (~20000 observations)
  2. Coefficient combinations to use with the particulate matter concentrations to calculate final concentrations.

For example..

Data set 1.

ID      PM 
 1       5 
 2      10 
...    ... 
1500    25 

Data set 2.

alpha    beta 
    5       6 
    1       2 
  ...     ... 

I ultimately have to use all the coefficient combinations (alpha and beta) for each of the IDs from data set 1. For example, if I have 10 observations in data set 1, and 10 coefficient combinations in data set 2, my output table should have 100 different output values (10*10=100).

for (i in cmaq$FID) {
  mean=cmaq$PM*IER$alpha*IER$beta 
}

I used the above code to do what I'm trying to do, but it only gave me 10 output values rather than 100. I think using the split function first, and somehow use that with the second dataset would work, but have not figured out how...

It may be a very very simple problem, but after spending hours to figure it out, I thought it may be a better strategy to get some help from R experts.

Thank you in advance!!!

Sharon
  • 3
  • 3

3 Answers3

1

Look for the term 'cross join' or 'cartesian join' (eg, How to do cross join in R?).

If that doesn't address the issue, please see https://stackoverflow.com/help/mcve. I think there is a mistake inside the loop. beta is free-floating, and not connected to the IER data.frame

wibeasley
  • 5,000
  • 3
  • 34
  • 62
  • Thanks for the response! Cross join seems to work... I just was wondering if there is a way to do this without creating a gigantic dataset, because I will end up with over 1.5 million observations... But cross joining def can solve my problem! – Sharon Jul 05 '17 at 04:27
  • If you need to cross and retain all possibilities (without summarizing/aggregating), I think it's unavoidable. And 1 or 2 million rows shouldn't be a problem: `object.size(tidyr::crossing(a=seq_len(1e5), b=seq_len(20)))` is only 1.6 MB. – wibeasley Jul 05 '17 at 04:36
1

You can do:

df1 = data.frame(
    ID = c(1, 2, 1500),
    PM = c(5, 10, 25)
)
df2 = data.frame(
    alpha = c(5, 6),
    beta = c(1, 2)
)

library(tidyverse)
library(dplyr)


df1 %>%
    group_by(ID) %>%
    do(data.frame(result = .$PM * df2$alpha * df2$beta,
                  alpha = df2$alpha,
                  beta = df2$beta))
Marius
  • 58,213
  • 16
  • 107
  • 105
  • .$q2d worked, but so did df1$q2d. Any specific reason why use.$q2d over df1$q2d? – Sharon Jul 05 '17 at 06:11
  • `.$` refers to the current group in the `group_by`, so just the rows of `df1` with the current ID. `df1$` refers to the whole dataframe, so you might be somehow repeating the same calculation on the whole dataset over and over for each ID. – Marius Jul 05 '17 at 06:16
  • Tidyverse should also load dplyr, so I don't think there is any point calling that library separately. – JAD Jul 05 '17 at 06:38
0

We can do this with outer

data.frame(ID = rep(df1$ID, each = nrow(df2)), alpha = df2$alpha, 
     beta = df2$beta, result = c(t(outer(df1$PM, df2$alpha*df2$beta))))
#    ID alpha beta result
#1    1     5    1     25
#2    1     6    2     60
#3    2     5    1     50
#4    2     6    2    120
#5 1500     5    1    125
#6 1500     6    2    300

data

df1 <- structure(list(ID = c(1, 2, 1500), PM = c(5, 10, 25)), .Names = c("ID", 
"PM"), row.names = c(NA, -3L), class = "data.frame")

df2 <- structure(list(alpha = c(5, 6), beta = c(1, 2)), .Names = c("alpha", 
 "beta"), row.names = c(NA, -2L), class = "data.frame")
akrun
  • 874,273
  • 37
  • 540
  • 662