6

I've written a function to grab and parse news data from Google for a given stock symbol, but I'm sure there are ways it could be improved. For starters, my function returns an object in the GMT timezone, rather than the user's current timezone, and it fails if passed a number greater than 299 (probably because google only returns 300 stories per stock). This is somewhat in response to my own question on stack overflow, and relies heavily on this blog post.

tl;dr: how can I improve this function?

 getNews <- function(symbol, number){

    # Warn about length
    if (number>300) {
        warning("May only get 300 stories from google")
    }

    # load libraries
    require(XML); require(plyr); require(stringr); require(lubridate);
    require(xts); require(RDSTK)

    # construct url to news feed rss and encode it correctly
    url.b1 = 'http://www.google.com/finance/company_news?q='
    url    = paste(url.b1, symbol, '&output=rss', "&start=", 1,
               "&num=", number, sep = '')
    url    = URLencode(url)

    # parse xml tree, get item nodes, extract data and return data frame
    doc   = xmlTreeParse(url, useInternalNodes = TRUE)
    nodes = getNodeSet(doc, "//item")
    mydf  = ldply(nodes, as.data.frame(xmlToList))

    # clean up names of data frame
    names(mydf) = str_replace_all(names(mydf), "value\\.", "")

    # convert pubDate to date-time object and convert time zone
    pubDate = strptime(mydf$pubDate, 
                     format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
    pubDate = with_tz(pubDate, tz = 'America/New_york')
    mydf$pubDate = NULL

    #Parse the description field
    mydf$description <- as.character(mydf$description)
    parseDescription <- function(x) {
        out <- html2text(x)$text
        out <- strsplit(out,'\n|--')[[1]]

        #Find Lead
        TextLength <- sapply(out,nchar)
        Lead <- out[TextLength==max(TextLength)]

        #Find Site
        Site <- out[3]

        #Return cleaned fields
        out <- c(Site,Lead)
        names(out) <- c('Site','Lead')
        out
    }
    description <- lapply(mydf$description,parseDescription)
    description <- do.call(rbind,description)
    mydf <- cbind(mydf,description)

    #Format as XTS object
    mydf = xts(mydf,order.by=pubDate)

    # drop Extra attributes that we don't use yet
    mydf$guid.text = mydf$guid..attrs = mydf$description = mydf$link = NULL
    return(mydf) 

}
Community
  • 1
  • 1
Zach
  • 29,791
  • 35
  • 142
  • 201

1 Answers1

6

Here is a shorter (and probably more efficient) version of your getNews function

  getNews2 <- function(symbol, number){

    # load libraries
    require(XML); require(plyr); require(stringr); require(lubridate);  

    # construct url to news feed rss and encode it correctly
    url.b1 = 'http://www.google.com/finance/company_news?q='
    url    = paste(url.b1, symbol, '&output=rss', "&start=", 1,
               "&num=", number, sep = '')
    url    = URLencode(url)

    # parse xml tree, get item nodes, extract data and return data frame
    doc   = xmlTreeParse(url, useInternalNodes = T);
    nodes = getNodeSet(doc, "//item");
    mydf  = ldply(nodes, as.data.frame(xmlToList))

    # clean up names of data frame
    names(mydf) = str_replace_all(names(mydf), "value\\.", "")

    # convert pubDate to date-time object and convert time zone
    mydf$pubDate = strptime(mydf$pubDate, 
                     format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
    mydf$pubDate = with_tz(mydf$pubDate, tz = 'America/New_york')

    # drop guid.text and guid..attrs
    mydf$guid.text = mydf$guid..attrs = NULL

    return(mydf)    
}

Moreover, there might be a bug in your code, as I tried using it for symbol = 'WMT' and it returned an error. I think getNews2 works fine for WMT too. Check it out and let me know if it works for you.

PS. The description column still contains html code. But it should be easy to extract the text from it. I will post an update when I find time

Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • That's great, thanks! When you update your code to parse the description column, do you think you could have it return an xts object? – Zach Apr 23 '11 at 14:05
  • I'm also trying to write a regular expression to extract the site name from the link. – Zach Apr 23 '11 at 14:12
  • @Zach To convert to xts object just write `myxts = as.xts(mydf[,names(mydf) != "pubDate"], order.by = mydf$pubDate)` – Ramnath Apr 23 '11 at 15:18
  • Any suggestions for parsing the "description" field? I'd like to remove all the extra stuff, and just be left with the first line of the article? – Zach Jul 01 '11 at 21:15
  • this code works `getNodeSet(doc, "//item/description")` but this code fails `getNodeSet(doc, "//item/description/div")`. What's going on here? (3rd section of your code) – Zach Jul 01 '11 at 21:46
  • @Zach. the reason `//item/description/div` does not work is because there is no `div` node under the `description` nodes. What makes it hard to parse the description is that there is no clear indication where the text of the article starts. – Ramnath Jul 02 '11 at 11:51
  • I wrote an updated version of your function, which depends on the "R Data science toolkit (RDSTK)." Parsing the html takes some time, but it works. Let me know what you think. – Zach Jul 06 '11 at 14:11
  • Hi @Ramnath, I get the following error when I run the above function. `getNews2("WMT", 30) Error: 1: StartTag: invalid element name 2: Extra content at the end of the document 4. stop(e) 3. (function (msg, ...) { if (length(grep("\\\n$", msg)) == 0) paste(msg, "\n", sep = "") ... 2. xmlTreeParse(url, useInternalNodes = T) 1. getNews2("WMT", 30)` – SoakingHummer Mar 17 '18 at 17:57
  • 1
    I believe the API for Google Finance has changed. – Ramnath Mar 18 '18 at 18:37