0

I am using the data found here: https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system. In my R studio, I have named the csv file, BRFSS2015. Below is the code I am trying to execute. I remove outliers from PA1MIN_ Then I turn MARITAL into a factor. Now I am trying to create a boxplot. I feel like the result looks weird. Is there something wrong in my graph syntax?

PA1MIN_ <- BRFSS2015$PA1MIN_
upper_PA1MIN_ <- quantile(PA1MIN_, 0.997, na.rm=TRUE)
lower_PA1MIN_ <- quantile(PA1MIN_, 0.003, na.rm=TRUE)
out_PA1MIN_ <- which(PA1MIN_ > upper_PA1MIN_ | PA1MIN_ < lower_PA1MIN_)
BRFSS2015_noout <- subset(BRFSS2015, PA1MIN_ > lower_PA1MIN_ & 
                 PA1MIN_ < upper_PA1MIN_)

MARITAL <- c('MARITAL')
BRFSS2015[MARITAL] <- lapply(BRFSS2015[MARITAL], factor)

ggplot(BRFSS2015_noout) +
  geom_boxplot(aes(PA1MIN_, MARITAL), na.rm=T)

Here is the graph I am getting: picture of graph

jakdar
  • 77
  • 6
  • What does look "weird" ? – Basti Oct 10 '22 at 13:33
  • Code looks fine, though hard to read because you don't use spaces after commas and such. You're gonna have to give us more to go on – Gregor Thomas Oct 10 '22 at 14:55
  • The link to data is a start, but most people don't like to download and import data to answer a quick question - if you instead shared a few rows of data in the question, `dput(BRFSS2015[1:20, c("relevant", "column", "names")])` it makes it much easier for us to test your code. It would also help if you could be more specific about the problem than "I feel like the result looks weird". Could you post a picture of the boxplot and say what about it looks weird? – Gregor Thomas Oct 10 '22 at 14:57
  • It also can be a little confusing when you create vectors outside of data frames that have the same name as columns from the data frame. You have a column named `PA1MIN_`. You also have a vector separate from the data frame named `PA1MIN_`. When you do `subset(BRFSS2015, PA1MIN_ > lower_PA1MIN_ & PA1MIN_ < upper_PA1MIN_)`, it's not clear which one is being used. They **should** be identical at that point, so it shouldn't matter, but it's a bad habit to introduce ambiguity like that. – Gregor Thomas Oct 10 '22 at 15:00
  • 1
    You've asked a lot of questions in the last 2 days. At a glance, not a single one of them includes a small, reproducible example in the question. If you'd like to use Stack Overflow effectively, I'd strongly suggest reading both the general site guidance [on creating minimal reproducible examples](https://stackoverflow.com/help/minimal-reproducible-example) and R-specific FAQ [How to make a great reproducible example in R](https://stackoverflow.com/q/5963269/903061). Your questions will get answered very quickly if you follow those guides. – Gregor Thomas Oct 10 '22 at 15:06
  • Sorry about the lack of spacing. I didn't realize that was making it so difficult for people to read. I have edited my code above to include spaces now. Thank you for pointing me to the site rules about the best way to ask questions. I was not aware the way I was describing the question was confusing. I have included a picture of the graph I am talking about. I think it looks weird because of the long string of data points that appears to be following every box. I am not really sure a boxplot is supposed to look like that, so that makes me wonder if something is wrong in my syntax. – jakdar Oct 11 '22 at 02:24
  • Okay, I think I found the way to do the dput() function you are talking about here: [link]https://stackoverflow.com/questions/49994249/example-of-using-dput I had never heard of that before. I am new to R and the world of programming. I apologize for the infraction. Can you please answer my question? I found all of the tips you provided helpful. – jakdar Oct 11 '22 at 02:42
  • Your boxplots look typical of a distribution for when you have a lot of values and a substantial right skew. I don't think you've done anything wrong. I don't know what `PA1MIN_` is, but you might consider putting it on a log scale, either adding `+ scale_x_log10()` to your plot or using `x = log(PA1MIN_)` in the aesthetic. (Use `log1p` if the data has zeros). – Gregor Thomas Oct 11 '22 at 03:27
  • 1
    Glad you took the comments to heart - if you start asking reproducible questions, you'll start getting helpful answers much more quickly. – Gregor Thomas Oct 11 '22 at 03:28
  • I agree with Gregor that this looks fairly typical of data with tons of observations per group. If you run `diamonds %>% ggplot(aes(x=cut, y=price))+ geom_boxplot()+ coord_flip()` you will find something very similar. – Shawn Hemelstrand Oct 11 '22 at 03:39

0 Answers0