3

I don't know if it is a good question or not.

Here's the case, say I have a scale/continuous dependent variable and a bunch of independent variables. My ultimate goal is to build a model to predict/estimate the dependent variable using these independent variables. I believe it's a common setting.

The point is that I know the physical meaning of all the variables, but I don't know their detailed relationship (or even related or not). I want to build a model more from an analysis/explanation point of view so that I could get some real-world insights from the model, instead of a black box.

My approach is trying to use CHAID kind of algorithm to build a decision tree type of model. At every branch, I want to statistically test each independent variable to see if there's relation between it and the dependent variable. Then, based on the test result, I want to pick the most powerful one to build my tree.

The problem is, unlike CHAID algorithm, where most variables are categorical, in my case, the dependent variable is scale, and independent variables are categorical or scale, which means I might need to do different statistical tests for different variables, e.g. t-test and ANOVA for categorical ones and regression for continuous ones. I'm wondering how should I fairly compare these results to pick the most powerful one? (like the correction step in CHAID)

Any idea on any part of my plan is of great importance to me! Thanks!

G. Yu
  • 31
  • 1
  • I suggest that you visually inspect scatterplots of dependent variable vs. each independent variable to determine if you might see any obvious relationship such as log or exponential shape. This is usually easy to do and sometimes yields helpful results. – James Phillips Aug 24 '18 at 21:18
  • Good advice! I've already done that, and observe certain patterns for certain independent variables. But the independent variables are highly correlated, so I guess I need to slide the data to get more insights, which get back to my question, how to find the most significant variable. – G. Yu Aug 24 '18 at 22:21
  • 2
    This is an interesting question, but it's beyond the scope of SO; I think stats.stackexchange.com is more suitable. That said, I suspect that using significance tests isn't meaningful in this context, since with sufficient data, almost all variables are related enough to pass a significance test. My advice is to take a Bayesian model-averaging approach. A web search for that should find some resources. I can say more about it if you like, perhaps after you open a question on stats.stackexchange.com. – Robert Dodier Aug 25 '18 at 00:34
  • Thanks Robert! I've posted the question there. I tried and find results like simple linear regression doesn't make too much sense due to its small slope. Right now, I'm thinking binning the continuous independent variable so that all the test will be the same ANOVA/Chi-square type of test. I will read about Bayesian model-averaging approach. – G. Yu Aug 27 '18 at 14:13
  • It seems Bayesian model-averaging needs multiple models before proceeding the combining step, while I'm trying to build my first model. – G. Yu Aug 27 '18 at 14:28
  • By the way, you've mentioned that the results for significance tests for large data sets are not meaningful. My understanding is due to even small correlation which caused by noise or anything less will have a very small p-value. But I think it is still meaningful to compare the results of different tests to get the most significant one. – G. Yu Aug 27 '18 at 17:41

0 Answers0