0

I am trying to do the regression of NHL stats for predictors with variables goals, assists and points. However, our output is different than our desired output. Instead of the predictors we specified( goals, assists, and points) we get every instance of our instance of our intercept. See below:

urlname <- "https://www.hockey-reference.com/leagues/NHL_2018_skaters.html"
scraped_data <- read_html(urlname)
table.nhl <- html_nodes(scraped_data, "table")

scraped.nhl.data <- as.data.frame(html_table(table.nhl, header = TRUE))
colnames(scraped.nhl.data) = scraped.nhl.data[1, ] # the first row will be the header
scraped.nhl.data = scraped.nhl.data[-1, ]          # removing the first row.
for (i in 1:nrow(scraped.nhl.data)){
  if (scraped.nhl.data[i,1] == "Rk"){
    scraped.nhl.data <- scraped.nhl.data[-i,]
  }
}

pittsburgh <- scraped.nhl.data[scraped.nhl.data$Tm == "PIT", ]
pittsburgmodel <- pittsburgh[, c( "G", "A", "PTS")]
pittsburgmodel <- pittsburgmodel[complete.cases(pittsburgmodel), ]
View(pittsburgmodel)
names(pittsburgmodel) <- c(" goals", "assists", "points")
attach(pittsburgmodel)
fit = lm(games played ~., data = pittsburgmodel)
summary(fit)

Output

Coefficients: (18 not defined because of singularities)
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -3.719e-15  2.835e-15 -1.312e+00    0.247    
assists1     2.000e+00  6.945e-15  2.880e+14   <2e-16 ***
assists10    4.000e+00  6.945e-15  5.759e+14   <2e-16 ***
assists12    1.800e+01  6.945e-15  2.592e+15   <2e-16 ***
assists13    5.000e+00  6.945e-15  7.199e+14   <2e-16 ***
assists2     4.000e+00  6.945e-15  5.759e+14   <2e-16 ***
assists20    2.900e+01  6.945e-15  4.175e+15   <2e-16 ***
assists21    1.100e+01  6.945e-15  1.584e+15   <2e-16 ***
assists22    7.000e+00  6.945e-15  1.008e+15   <2e-16 ***
assists23    4.000e+00  6.945e-15  5.759e+14   <2e-16 ***
assists25    1.300e+01  6.945e-15  1.872e+15   <2e-16 ***
assists26    2.200e+01  6.945e-15  3.168e+15   <2e-16 ***
assists3     2.000e+00  5.305e-15  3.770e+14   <2e-16 ***
assists4     4.000e+00  6.945e-15  5.759e+14   <2e-16 ***
assists42    9.000e+00  6.945e-15  1.296e+15   <2e-16 ***
assists5     3.000e+00  6.945e-15  4.319e+14   <2e-16 ***
assists56    4.200e+01  6.945e-15  6.047e+15   <2e-16 ***
assists58    3.400e+01  6.945e-15  4.895e+15   <2e-16 ***
assists6     2.000e+00  6.945e-15  2.880e+14   <2e-16 ***
assists60    2.900e+01  6.945e-15  4.175e+15   <2e-16 ***
assists8     4.000e+00  6.945e-15  5.759e+14   <2e-16 ***
points1      1.000e+00  6.945e-15  1.440e+14   <2e-16 ***
points10     2.000e+00  8.967e-15  2.231e+14   <2e-16 ***
points12            NA         NA         NA       NA    
points13    -1.000e+00  8.967e-15 -1.115e+14   <2e-16 ***
points14            NA         NA         NA       NA    
points18            NA         NA         NA       NA    
points27            NA         NA         NA       NA    
points29            NA         NA         NA       NA    
points3             NA         NA         NA       NA    
points30            NA         NA         NA       NA    
points31    -1.000e+00  8.967e-15 -1.115e+14   <2e-16 ***
points32            NA         NA         NA       NA    
points38            NA         NA         NA       NA    
points4     -2.000e+00  8.967e-15 -2.231e+14   <2e-16 ***
points48            NA         NA         NA       NA    
points49            NA         NA         NA       NA    
points5             NA         NA         NA       NA    
points51            NA         NA         NA       NA    
points6             NA         NA         NA       NA    
points8             NA         NA         NA       NA    
points89            NA         NA         NA       NA    
points92            NA         NA         NA       NA    
points98            NA         NA         NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.34e-15 on 5 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 3.72e+30 on 25 and 5 DF,  p-value: < 2.2e-16

Desired output

                 Estimate     Std. Error      t value   Pr(>|t|)
(Intercept)        value         value          value     value
Goals              value         value          value     value 
Assists            value         value          value     value
Xin
  • 666
  • 4
  • 16
  • 3
    use `str(pittsburgmodel)` to look at the data types for each of your columns. It looks like the values that look numeric aren't actually coded as numeric values. – MrFlick Nov 12 '18 at 19:54

2 Answers2

0

Before lm

pittsburghmodel$points <- as. numeric(as.character(pittsburghmodel$points)
pittsburghmodel$assists <- as. numeric(as.character(pittsburghmodel$assists)

Furthermore, don't use the attach command and improve use of terms avoiding the use of model for a dataset.

paoloeusebi
  • 1,056
  • 8
  • 19
  • this is probably the solution, but can you please explain in more detail? (i.e., you're converting factors (categorical variables) to numeric; there are [previous questions](https://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-integer-numeric-without-loss-of-information) you can link ... as well as [R FAQs](https://cran.r-project.org/doc/FAQ/R-FAQ.html#How-do-I-convert-factors-to-numeric_003f) – Ben Bolker Nov 12 '18 at 23:25
0

It's best to spend a little bit more time going upstream and fixing the information in the table. This example uses the XML package, because as pointed out by this blog post, the XML::readHTMLTable function has a skip parameter, which html_table apparently doesn't ...

Read raw HTML:

urlname <- "https://www.hockey-reference.com/leagues/NHL_2018_skaters.html"
rr <- readLines(urlname)

First try at reading: header + skipping row 1

library(XML)
h1 <- readHTMLTable(rr, header=TRUE,skip=1)$stats

There are bad (non-numeric) rows interspersed in the data, which are apparently extra, internal 'header' rows. Define a function to find them:

br  <- function(i,x=h1) { 
    suppressWarnings(which(is.na(as.numeric(as.character(x[[i]])))))
}
badrows <- br(1)

Try again, skipping 'bad' rows:

h2 <- readHTMLTable(rr, header=TRUE,skip=c(1,badrows+1))$stats

Define numeric columns as all but these 4:

numcols <- setdiff(names(h2),c("Player", "Tm", "Pos", "ATOI"))

Convert columns that should be numeric:

for (i in numcols) {
    h2[[i]] <- as.numeric(as.character(h2[[i]]))
}
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453