Is it possible to filter data based on its plot/predicted curve?

Question

I had a question regarding excluding/filtering data points. I currently have coded a logistic regression that generates a decision boundary that is wrapped up into a function in which I am able to run over subsets of my data frame.

I was wondering, if I were to plot all of the predicted curves based on these outcomes, if it is possible to filter these decision boundaries even further based on their generated plot/curve. Or if it is possible to set requirements in order for a curve to “qualify” and track the corresponding data in the data frame...

## glm that generates a midpoint/decision boundary, wrapped into a function

get_midpoint = function(data){
      glm.1 = glm(coderesponse~stimulus, family = binomial(link="logit"), data=data, na.action = na.exclude)
      rtn = -glm.1$coefficients[1]/glm.1$coefficients[2]
rtn
}

## a mini dummy dataframe 

subject <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
stimulus = c(1, 5, 50, 35, 23, 2, 4, 22, 15, 6, 20, 40, 45, 10, 37, 43, 48, 7, 19, 21, 29, 49, 26, 11, 36, 30, 39, 41, 16, 37, 1, 5, 50, 35, 23, 2, 4, 22, 15, 6, 20, 40, 45, 10, 37, 43, 48, 7, 19, 21, 29, 49, 26, 11, 36, 30, 39, 41, 16, 37)
stim <- c('bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm')
block <- c('mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose')
coderesponse <- c(1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0)

df = data.frame(subject, stimulus, stim, block, coderesponse)

## running the function over defined subgroups of ~80 rows each [for the real data]
## but for the dummy dataframe, only ~5 rows

df = df %>% 
  nest(data=-c(subject, stim, block)) %>%
  mutate(midpoint=map_dbl(data, get_midpoint)) %>%
  unnest()

## basic code that plots and creates a curve based on a single glm result
## QUESTION: want to be able to run this over the same subgroups as above to create curves for every midpoint generated and then possibly filter based on the curve?
plot(df$stimulus,df$coderesponse,xlab="stimulus",ylab="Probability of d responses")
curve(predict(glm.1,data.frame(stimulus=x),type="response"),add=TRUE)

I’m quite new and confused with this part of R, so thanks for any help or insight!

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. It's a bit unclear what you are describing. — MrFlick, Dec 01 '20 at 18:43
When running this, I get a `object 'glm.1' not found` error in the last line. I see the model embedded in the `get_midpoint()` fx used in your `mutate`, but you don't return the `glm.1` model anywhere. — Steven, Dec 01 '20 at 22:51
@Steven ah sorry, The last piece of code (e.g. lines regarding `plot` & `curve` prediction) is only applicable to a single output from the `glm.1` model. I'm trying to figure out how to modify it to output the plot and curves in correspondence to values generated from the `get_midpoint()` function, which I'm still having trouble with. edit: Is it possible to plot all of the glm.1 values from the subgroups of data using ggplot or does it require some sort of function? — LizJu, Dec 02 '20 at 00:13
@LizJu I'm still unsure if I understand exactly what you're looking for. It reads to me like you want to model `coderesponse~stimulus` as a `glm`, grouped by `subject`, then plot the data and each model on the same figure. If that's the case, easy. `ggplot()` can plot models for you. If it's something else, I'm missing a key component for my understanding. — Steven, Dec 02 '20 at 16:52

score 0 · Answer 1 · answered Dec 02 '20 at 17:00

I think what you're trying to do is the following:

library(ggplot2)
library(dplyr)

df %>%
  ggplot() +
  aes(x = stimulus, y = coderesponse, colour = subject %>% as.factor()) +
  geom_point() +
  geom_smooth(method = 'glm', method.args = list(family = binomial(link='logit')), se = F) +
  scale_colour_discrete(name = "Subject") +
  theme(legend.position = "bottom")

This takes your original df and simply plots the data, colored by subject, then runs the glm model over both subject groups in your data. You can run each glm outside of the geom_smooth() statement if you need to use them to predict. There could be a way to use the ggplot-produced models without burning additional computations on remodeling.

Is it possible to filter data based on its plot/predicted curve?

1 Answers1