1

I have a dataframe df, and its first row row1:

df <- data.frame(x = c(1,1,1,1,0,0,1), y = c(0,0,0,0,1,1,0), z = c(1,0,0,0,0,1,1))
> df
  x y z
1 1 0 1
2 1 0 0
3 1 0 0
4 1 0 0
5 0 1 0
6 0 1 1
7 1 0 1
> row1 <- df[1,]
> row1
  x y z
1 1 0 1

For row 1, I want to extract the mean of column z, within the rows that have the same value of other columns (x and y) as in row1. I want to store this value in a new column called prob_z.

In this example, since there are 5 rows with x and y as 1 and 0 respectively, of which 2 rows have z = 1, and 3 rows have z = 0, I would affix a column prob_z to row 1 with the value 2/(2+3) = 0.4. I would similarly compute the values of prob_x, prob_y and prob_z for each row and corresponding column, and store them as columns in df. So, after all these computations by each row and each column, I want to end up with:

  x y z prob_x prob_y  prob_z
1 1 0 1  1.0     0.0     0.4
2 1 0 0  1.0     0.0     0.4  
3 1 0 0  1.0     0.0     0.4
4 1 0 0  1.0     0.0     0.4 
5 0 1 0  0.0     1.0     0.5
6 0 1 1  0.0     1.0     0.5
7 1 0 1  1.0     0.0     0.4

I also need the columns to be referenced by name as in -"z", since in my real example, I have a large number of columns.

I have tried different approaches using base R and dplyr, but am having a hard time getting the result as expected.

Anand
  • 3,690
  • 4
  • 33
  • 64
  • not 100% sure what you are asking when the prob_x, prob_y issue gets involved but we could do `df <- df %>% group_by(x) %>% mutate(prob_z = sum(z)/n())` – Nate Nov 15 '16 at 13:54
  • @NathanDay I need to specify something like `group_by(-"z")` - since there are many columns, and I want to leave out each column and group by the others. If I do this, I get `Error: invalid argument to unary operator`. – Anand Nov 15 '16 at 14:02
  • would you be able to type out your ideal output and edit it into your question? – Nate Nov 15 '16 at 14:06
  • Please suggest sample output to clarify – Megatron Nov 15 '16 at 14:07
  • Just added final output with prob_ columns (let me know if the logic for computing probs seems wrong or unclear) – Anand Nov 15 '16 at 14:08
  • By the way, this is an attempt at a recommender algorithm inspired by knn - instead of computing k nearest neighbors, we look, for each item, at all otherwise identical records, and compute the probability of that item being 1. Using this, we can then look at the probabilities of the items in each row, and use it to determine order of recommendation. – Anand Nov 15 '16 at 16:04

3 Answers3

2

Alright, I think I have it. This was a fun set of challenges, though a fair bit more challenging than I expected it to be. I believe it is robust to any number of columns, and any types of names, but I could certainly be wrong.

First, I am defining a function that checks all columns except the column of interest to see if they match. Note that it explicitly calls df and row1 instead of trying to pass those values in as variables. Probably not ideal, but should work for this case:

myFunction <-
  function(thisCol){
    apply(select_(df, paste0("-`", thisCol,"`"))
          , 1
          , function(thisRow) {
            all(thisRow == select_(row1, paste0("-`", thisCol,"`")))
            })
  }

Then, I used interp from lazyeval to generate a column telling if there is a match or not. Recall that the "match" actually checks to see if all the other columns match row one, not whether or not this one does.

library(lazyeval)

forMatchID <-
  lapply(names(df), function(thisColName){
    interp(~myFunction(colName), colName = thisColName)
  }) %>%
  setNames(paste("Match", names(df)))

Which returns:

$`Match x`
~myFunction("x")
<environment: 0x110feb20>

$`Match y`
~myFunction("y")
<environment: 0x11103da8>

$`Match z`
~myFunction("z")
<environment: 0x111080c8>

Then, I used that to generate a data.frame that said whether or not there was an acceptable match for that column (again, matching all others from row 1 or not):

dfWithMatchCols <-
  df %>%
  mutate_(.dots = forMatchID)

Which returns:

  x y z Match x Match y Match z
1 1 0 1    TRUE    TRUE    TRUE
2 1 0 0   FALSE   FALSE    TRUE
3 1 0 0   FALSE   FALSE    TRUE
4 1 0 0   FALSE   FALSE    TRUE
5 0 1 0   FALSE   FALSE   FALSE
6 0 1 1   FALSE   FALSE   FALSE
7 1 0 1    TRUE    TRUE    TRUE

Then, I generate a new set of columns to generate, here generating the proportion of rows that match row one (on other columns) -- or those that don't -- that have a value of 1 for that column:

forProb <-
  paste0("ifelse(`Match ", names(df), "`"
         , ", mean(`", names(df), "`[`Match ", names(df), "`])"
         , ", mean(`", names(df), "`[!`Match ", names(df), "`]) )") %>%
  setNames(paste0("prob_", names(df)))

Which returns

                                                           prob_x 
"ifelse(`Match x`, mean(`x`[`Match x`]), mean(`x`[!`Match x`]) )" 
                                                           prob_y 
"ifelse(`Match y`, mean(`y`[`Match y`]), mean(`y`[!`Match y`]) )" 
                                                           prob_z 
"ifelse(`Match z`, mean(`z`[`Match z`]), mean(`z`[!`Match z`]) )" 

Finally, I pass this into mutate_ and remove the "Match" columns (note, could accidentally remove columns if they start with that, but that would likely also cause collisions above):

dfWithProb <-
  dfWithMatchCols %>%
  mutate_(.dots = forProb) %>%
  select(-starts_with("Match"))

Returns:

  x y z prob_x prob_y prob_z
1 1 0 1    1.0    0.0    0.4
2 1 0 0    0.6    0.4    0.4
3 1 0 0    0.6    0.4    0.4
4 1 0 0    0.6    0.4    0.4
5 0 1 0    0.6    0.4    0.5
6 0 1 1    0.6    0.4    0.5
7 1 0 1    1.0    0.0    0.4

In practice, you would probably not generate the intermediate step, instead running it like this:

df %>%
  mutate_(.dots = forMatchID) %>%
  mutate_(.dots = forProb) %>%
  select(-starts_with("Match"))

Now, if I misunderstood your "matching" argument, a simple change to the myFunction definition should then propagate to all other steps in the process.

Summarise for a single value

Based on a comment, it appears that you actually want just a single output, to predict probability only for those that match on all other elements. I think you may be better served with using summarise_, like so:

forSingProb <-
  paste0("mean(`", names(df), "`[`Match ", names(df), "`])") %>%
  setNames(paste0("prob_", names(df)))

df %>%
  mutate_(.dots = forMatchID) %>%
  summarise_(.dots = forSingProb)

Which returns:

  prob_x prob_y prob_z
1      1      0    0.4

Getting values for all rows

Alright, following several updates in comments, I think this should work. I am using the summarise_ approach from above, looping through each row separately in lapply, and removing the row that is being investigated from the analysis (it's inclusion of the value of interest should not play a role in the outcome), then binding everything together with bind_rows and joining back to the original data:

myFunction_updated <-
  function(thisCol, rowIndex){
    apply(select_(df[-rowIndex, ], paste0("-`", thisCol,"`"))
          , 1
          , function(thisRow) {
            all(thisRow == select_(df[rowIndex, ], paste0("-`", thisCol,"`")))
          })
  }

forSingProb <-
  paste0("mean(`", names(df), "`[`Match ", names(df), "`])") %>%
  setNames(paste0("prob_", names(df)))

lapply(1:nrow(df), function(thisRowIndex){
  forMatchID <-
    lapply(names(df), function(thisColName){
      interp(~myFunction_updated(colName, rowIndex)
             , colName = thisColName
             , rowIndex = thisRowIndex)
    }) %>%
    setNames(paste("Match", names(df)))

  df[-thisRowIndex, ] %>%
    mutate_(.dots = forMatchID) %>%
    summarise_(.dots = forSingProb)
}) %>%
  bind_rows(.id = "rowIndex") %>%
  left_join(mutate(df, rowIndex = as.character(1:n()))
            , .)

returns:

  x y z rowIndex prob_x prob_y prob_z
1 1 0 1        1      1      0   0.25
2 1 0 0        2      1      0   0.50
3 1 0 0        3      1      0   0.50
4 1 0 0        4      1      0   0.50
5 0 1 0        5    NaN    NaN   1.00
6 0 1 1        6    NaN    NaN   0.00
7 1 0 1        7      1      0   0.25

Note that the NaN values are correct, as there are no matching rows.

If you insist on including the observed value in your prediction (which I want to stress is almost certainly a bad idea), you can adjust it like so:

myFunction_updated <-
  function(thisCol, rowIndex){
    apply(select_(df, paste0("-`", thisCol,"`"))
          , 1
          , function(thisRow) {
            all(thisRow == select_(df[rowIndex, ], paste0("-`", thisCol,"`")))
          })
  }

forSingProb <-
  paste0("mean(`", names(df), "`[`Match ", names(df), "`])") %>%
  setNames(paste0("prob_", names(df)))


lapply(1:nrow(df), function(thisRowIndex){
  forMatchID <-
    lapply(names(df), function(thisColName){
      interp(~myFunction_updated(colName, rowIndex)
             , colName = thisColName
             , rowIndex = thisRowIndex)
    }) %>%
    setNames(paste("Match", names(df)))

  df %>%
    mutate_(.dots = forMatchID) %>%
    summarise_(.dots = forSingProb)
}) %>%
  bind_rows(.id = "rowIndex") %>%
  left_join(mutate(df, rowIndex = as.character(1:n()))
            , .)

Which gives:

  x y z rowIndex prob_x prob_y prob_z
1 1 0 1        1      1      0    0.4
2 1 0 0        2      1      0    0.4
3 1 0 0        3      1      0    0.4
4 1 0 0        4      1      0    0.4
5 0 1 0        5      0      1    0.5
6 0 1 1        6      0      1    0.5
7 1 0 1        7      1      0    0.4

For use in production

Why can't I leave this alone? Anyway, here is an extension to a logical next step. My guess is that this will be used to predict missing variables from a given set. Here, for example, as might be generated in your actual application. Specifically, it samples 1 or 2 of the variables, and randomly sets each to 0 or 1.

productionData <-
  lapply(1:10, function(idx){
    nToSample <- sample(1:2, 1)

    sample(c(0,1), nToSample, replace = TRUE) %>%
      setNames(sample(c("x","y","z"), nToSample))
  })

Then, we can loop through each, filter the known data to those that match on all points, and then calculate a probability for all of the remaining variables. The bind_rows and select at the end are just for pretty printing. Depending on usage, this step may or may not be more useful. In particular, if any variables are either always or never present, this will fail (because there are no prob_* values or no entries for that variable column)

lapply(productionData, function(thisRowIn){
  filtering <-
    lapply(names(thisRowIn), function(thisCol){
      paste0("`", thisCol, "` == ", thisRowIn[thisCol])
    })

  whichMissing <-
    names(df)[!(names(df) %in% names(thisRowIn))]

  df %>%
    filter_(.dots = filtering) %>%
    summarise_at(whichMissing, mean) %>%
    setNames(paste0("prob_", names(.))) %>%
    mutate_(.dots = as.list(thisRowIn)) 
}) %>%
  bind_rows() %>%
  select_(.dots = c(names(df), paste0("prob_", names(df))))

returns:

    x  y  z prob_x prob_y prob_z
1  NA  1 NA   0.00     NA    0.5
2  NA  0 NA   1.00     NA    0.4
3   0  1 NA     NA     NA    0.5
4  NA  1  1   0.00     NA     NA
5   1 NA  1     NA   0.00     NA
6   1  0 NA     NA     NA    0.4
7  NA NA  0   0.75   0.25     NA
8   1  0 NA     NA     NA    0.4
9  NA  0 NA   1.00     NA    0.4
10  1 NA  1     NA   0.00     NA

With NA where values are missing and where predictions are not needed.

Mark Peterson
  • 9,370
  • 2
  • 25
  • 48
  • thanks for the answer - it gives wrong values for the probs. For example, in row 2, pivoting on x, we see that y and z are 0. This happens in rows 2, 3 and 4. In all those rows, x is 1, so we would expect prob_x in row 2, 3, 4 to be 1.0 (from (1 + 1 + 1)/3). Your solution sets the value as 0.6. – Anand Nov 15 '16 at 15:59
  • You are omitting rows 5 and 6 (which have `y == 1` and have `x == 0`). Which group do you think rows 5 and 6 should be included in? Am I misunderstanding? – Mark Peterson Nov 15 '16 at 16:34
  • While considering row 2, pivoting on x, we are only interested in the ratio within the rows where the y-z signature is the same as in the current row (row 2), which is 0-0. Rows 5 and 6 have a different signature for y-z (1-0 and 1-1 respectively) and hence do not factor into the computation of prob_x for row 2. Does that clarify? – Anand Nov 15 '16 at 16:38
  • Your question did not ask to pivot separately on each row -- it asked to match (or not) on the first row. See my most recent edit for a version that gives just a simple summary for that row. I will see if I can generate an output that gives those results for each row. – Mark Peterson Nov 15 '16 at 16:41
  • Sorry, I think you missed my immediate update to the question, showing the desired output. From my question: "For row 1, I want to extract the mean of column z, within the rows that have the same value of other columns (x and y) as in row1. I want to store this value in a new column called prob_z." and (in the update right after, showing output, which you might have missed, sorry.): "So, after all these computations by each row and each column, I want to end up with: ..." – Anand Nov 15 '16 at 16:44
  • Also, see my own answer to the question - I don't quite understand how to combine lazyeval::interp with mutate_ and setNames to simplify that into a loop, am trying to understand that part of your answer to integrate it. – Anand Nov 15 '16 at 16:50
  • See update that wraps the `summarise_` in an `lapply` loop. – Mark Peterson Nov 15 '16 at 16:52
  • Thanks, there should be no NaNs since there is at least one row (the row itself) which has the pattern. The computation for each row/column is done including the current row being considered. I suspect the other values are also different because you are considering the ratio excluding the current row. – Anand Nov 15 '16 at 16:53
  • See update, though I stress that if you are using this for any sort of model building that sort of data leakage is going to make you over confident. – Mark Peterson Nov 15 '16 at 16:58
  • Thanks, this final update works. I see your point about excluding the current row - will try both ways, and compare. – Anand Nov 16 '16 at 01:55
1

Would it be acceptable if we did this?

df %>% group_by(x) %>% mutate(prob_x = sum(x) / n(),
                                prob_y = sum(y) / n(),
                                prob_z = sum(z) / n())
Source: local data frame [7 x 6]
Groups: x [2]

      x     y     z prob_x prob_y prob_z
  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1     1     0     1      1      0    0.4
2     1     0     0      1      0    0.4
3     1     0     0      1      0    0.4
4     1     0     0      1      0    0.4
5     0     1     0      0      1    0.5
6     0     1     1      0      1    0.5
7     1     0     1      1      0    0.4
Nate
  • 10,361
  • 3
  • 33
  • 40
  • This does work for the given data. But I don't think it will generalize to arbitrary data. Specifically, your approach picks x arbitrarily to group by, and if we pick y instead, we still get the same results. But if we group by z, the results are completely different. This seems to indicate that the solution would not generalize. – Anand Nov 15 '16 at 15:55
0

This solution is composed of Nathan Day's approach, Mark Peterson's use of mutate_, and this SO question - this should generalize well.

df <- data.frame(x = c(1,1,1,1,0,0,1), y = c(0,0,0,0,1,1,0), z = c(1,0,0,0,0,1,1))
for(i in 1:3) {
    dots <- paste("mean(", names(df[i]), ")")
    df <- df %>% 
        group_by_(.dots = lapply(names(df)[-i], as.symbol)) %>% 
        mutate_(.dots = setNames(dots, paste("prob_", names(df[i]) )))
}
df
      x     y     z prob_x prob_y prob_z
  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1     1     0     1      1      0    0.4
2     1     0     0      1      0    0.4
3     1     0     0      1      0    0.4
4     1     0     0      1      0    0.4
5     0     1     0      0      1    0.5
6     0     1     1      0      1    0.5
7     1     0     1      1      0    0.4
Community
  • 1
  • 1
Anand
  • 3,690
  • 4
  • 33
  • 64