0

Here is my data. It's a table of baseball statistics.

structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L, 
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L, 
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L, 
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L, 
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L, 
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L, 
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L, 
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L, 
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L, 
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L, 
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")

Here is some of what I'm hoping to accomplish:

  • Figure out which variables have missing NA values
  • Create a histogram for the variables that do have NA values in order to look at the distribution
  • Do mean or median imputation as appropriate

For the first part, I looked at a summary of the data in order to see which variables have NA values. As it turns out, four of the 16 variables have NA.

summary(moneyball_training_data)

I then gather the data so I can do a facet_wrap in ggplot.

gathered_moneyball_training_data <- gather(moneyball_training_data, 'type', 'number', 2:17)

I filter the long dataframe by the variables that have NA values in them. I get these variables by eyeballing the results of the summary call.

filtered_gathered_moneyball_training_data <- gathered_moneyball_training_data %>%
  filter(type == "TEAM_BATTING_SO" | type == "TEAM_BATTING_HBP" | type == "TEAM_PITCHING_SO" | type == "TEAM_FIELDING_DP")
  

I then ggplot these four variables with a facet wrap.

ggplot(filtered_gathered_moneyball_training_data, aes(x = number)) +
  geom_histogram()  + 
  facet_wrap('type')

enter image description here

My problem is - this graph isn't very informative. If the distribution were normal, I would do mean imputation on the variable with missing NA; if it were skewed I would use the median.

Is there a way to make each section of the facet_wrap with its own scale? If I were to facet_wrap the variables by themselves, ggplot would fi good x-axis limits in order to make the graph readable, but all together it doesn't work well.

hachiko
  • 671
  • 7
  • 20

0 Answers0