How to match a specific string using regular expressions in R

Question

I am trying to extract some financial data using regular expressions in R.

I have used a RegEx tester, http://regexr.com/, to make a regular expression that SHOULD capture the information I need - the problem is just that it doesn't...

I have extracted data from this URL: http://finance.yahoo.com/q/cp?s=%5EOMXC20+Components

I want to match the company names (DANSKE.CO, DSV.CO etc.) and I have created following regular expression which matches it on regexr.com:

.q\?s=(\S*\\)

But it doesn't work in R. Can someone help me figure out how to go about this?

Use double backslashes in R strings when defining shorthand character classes like `\s` -> `"\\s"`. — Wiktor Stribiżew, Mar 29 '16 at 18:56
You will probably need to start by escaping special characters, such as \ with another \. — Roman Luštrik, Mar 29 '16 at 18:56
Obligatory response to somebody posting about regex-ing HTML... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — cory, Mar 29 '16 at 20:08

score 2 · Answer 1 · answered Mar 29 '16 at 19:09

Instead of messing around with regular expressions I would use XPath for something like fetching HTML content:

library("XML")
f <- tempfile()
download.file("https://finance.yahoo.com/q/cp?s=^OMXC20+Components", f)
doc <- htmlParse(f)
xpathSApply(doc, "//b/a", xmlValue)
#  [1] "CARL-B.CO"   "CHR.CO"      "COLO-B.CO"   "DANSKE.CO"   "DSV.CO"     
#  [6] "FLS.CO"      "GEN.CO"      "GN.CO"       "ISS.CO"      "JYSK.CO"    
# [11] "MAERSK-A.CO" "MAERSK-B.CO" "NDA-DKK.CO"  "NOVO-B.CO"   "NZYM-B.CO"  
# [16] "PNDORA.CO"   "TDC.CO"      "TRYG.CO"     "VWS.CO"      "WDH.CO"

score 0 · Answer 2 · answered May 11 '16 at 03:20

Does this help? If not, post back, and I'll provide another suggestion.

library(XML)

stocks <- c("AXP","BA","CAT","CSCO")

for (s in stocks) {
      url <- paste0("http://finviz.com/quote.ashx?t=", s)
      webpage <- readLines(url)
      html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
      tableNodes <- getNodeSet(html, "//table")

      # ASSIGN TO STOCK NAMED DFS
      assign(s, readHTMLTable(tableNodes[[9]], 
                header= c("data1", "data2", "data3", "data4", "data5", "data6",
                          "data7", "data8", "data9", "data10", "data11", "data12")))

      # ADD COLUMN TO IDENTIFY STOCK 
      df <- get(s)
      df['stock'] <- s
      assign(s, df)
}

# COMBINE ALL STOCK DATA 
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]

# SAVE TO CSV
write.table(stockdata, "C:/Users/rshuell001/Desktop/MyData.csv", sep=",", 
            row.names=FALSE, col.names=FALSE)

# REMOVE TEMP OBJECTS
rm(df, stockdatalist)

How to match a specific string using regular expressions in R

2 Answers2