R Regex seemingly not working properly in Linux

Question

I'm trying to scrape the webpage of Fangraphs with alphabetical player indices to get a single column dataframe of each letter reference.

I have been able to get the code below to successfully work on a Windows version of R 3.4.1, but cannot get it to work on the Linux side at all, and I can't figure out what exactly is going wrong/different.

library(XML)

# Scrape to get the webpage
url <- paste0("http://www.fangraphs.com/players.aspx?")
table <- readHTMLTable(url, stringsAsFactors = FALSE)
letterz <- table[[2]]
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=", ")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters") 
letterz$letters <- as.character(letterz$letters)

# Below this is where I can notice that the code is not operating the same
# as on my Windows machine. None of the gsub commands seem to impact
# the strings at all.

# Stripping the trailing whitespace
letterz$letters <- gsub("[[:space:]]+$", "", letterz$letters)

# Replacing patterns like "AzB   Ba" to instead have "Az,Ba"
letterz$letters <- gsub("[[:upper:]]+?[[:space:]]+?[[:space:]]+?[[:space:]]+", ",", letterz$letters)

# Final cleaning up
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=",")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters") 
letterz$letters <- as.character(letterz$letters)
letterz$letters <- gsub('c\\("|"\\)|"', "", letterz$letters)
letterz$letters <- gsub('^$', NA, letterz$letters)
letterz$letters <- gsub("^[[:space:]]+","", letterz$letters)
letterz$letters <- gsub("[[:space:]]+$","", letterz$letters)
letterz$letters <- gsub("'", "%27", letterz$letters)
letterz <- na.omit(letterz)

From what I could find, the only real difference between Windows/Linux regex would be the linebreak implementation, which I went back and tried to see if that was making the difference... but still got no change.

I also tried to substitute the R-specific "[[:space:]]" and "[[:upper:]]" style notation with the more standardized "\s" to see if that would fix anything.

As for fixes, I know there are a handful of other packages that I can look into to simply get the result I'm looking for, but more generally, are there just simply differences in how Windows and Linux implement regex that I'm unaware of and am oblivious to? And if so, how would I implement them into gsub to get the same result I get on Windows?

Thanks.

You have non breaking spaces instead of regular white spaces. See here: https://stackoverflow.com/questions/43734293/remove-non-breaking-space-character-in-string-in-r-on-linux — nicola, Sep 16 '17 at 19:58
Oh wow, awesome thanks! I ended up getting it to work by mirroring the page you linked to with: _letterz$letters <- gsub("(*UCP)[A-Z]+?\\s+?\\s+?\\s+", ",", letterz$letters, perl = TRUE)_ Does the enabling Unicode explain the difference between Windows and Linux here? I'd definitely like to understand the reason behind why this is operating differently between the two. — ImTerribleWithComputers, Sep 16 '17 at 20:08
The version you have just does not match Unicode strings with default TRE engine POSIX character classes. Why is it like that - I have no idea, I am not the author of those R builds. — Wiktor Stribiżew, Sep 18 '17 at 21:07

R Regex seemingly not working properly in Linux

0 Answers0