I'm trying to scrape the webpage of Fangraphs with alphabetical player indices to get a single column dataframe of each letter reference.
I have been able to get the code below to successfully work on a Windows version of R 3.4.1, but cannot get it to work on the Linux side at all, and I can't figure out what exactly is going wrong/different.
library(XML)
# Scrape to get the webpage
url <- paste0("http://www.fangraphs.com/players.aspx?")
table <- readHTMLTable(url, stringsAsFactors = FALSE)
letterz <- table[[2]]
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=", ")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters")
letterz$letters <- as.character(letterz$letters)
# Below this is where I can notice that the code is not operating the same
# as on my Windows machine. None of the gsub commands seem to impact
# the strings at all.
# Stripping the trailing whitespace
letterz$letters <- gsub("[[:space:]]+$", "", letterz$letters)
# Replacing patterns like "AzB Ba" to instead have "Az,Ba"
letterz$letters <- gsub("[[:upper:]]+?[[:space:]]+?[[:space:]]+?[[:space:]]+", ",", letterz$letters)
# Final cleaning up
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=",")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters")
letterz$letters <- as.character(letterz$letters)
letterz$letters <- gsub('c\\("|"\\)|"', "", letterz$letters)
letterz$letters <- gsub('^$', NA, letterz$letters)
letterz$letters <- gsub("^[[:space:]]+","", letterz$letters)
letterz$letters <- gsub("[[:space:]]+$","", letterz$letters)
letterz$letters <- gsub("'", "%27", letterz$letters)
letterz <- na.omit(letterz)
From what I could find, the only real difference between Windows/Linux regex would be the linebreak implementation, which I went back and tried to see if that was making the difference... but still got no change.
I also tried to substitute the R-specific "[[:space:]]" and "[[:upper:]]" style notation with the more standardized "\s" to see if that would fix anything.
As for fixes, I know there are a handful of other packages that I can look into to simply get the result I'm looking for, but more generally, are there just simply differences in how Windows and Linux implement regex that I'm unaware of and am oblivious to? And if so, how would I implement them into gsub to get the same result I get on Windows?
Thanks.