E-Mail Text Parsing and Extracting via Regular Expression in R

Question

After over a year struggling to no avail, I'm turning the SO community for help. I've used various RegEx creator sites, standalone RegEx creator software as well as manual editing all in a futile attempt to create a pattern to parse and extract dynamic data from the below e-mail samples (sanitized to protect the innocent):

Action to Take: Buy shares of Facebook (Nasdaq: FB) at market. Use a 20% trailing stop to protect yourself. ...

Action to Take: Buy Google (Nasdaq: GOOG) at $42.34 or lower. If the stock is above $42.34, don't chase it. Wait for it to come down. Place a stop at $35.75. ...

***Action to Take*** Buy International Business Machines (NYSE: IBM) at market. And use a protective stop at $51. ...

What needs to be parsed is both forms of "Action to Take" sections and the resulting extracted data must include the direction (i.e. buy or sell, but just concerned about buys here), the ticker, the limit price (if applicable) and the stop value as either a percentage or number (if applicable). Sometimes there's also multiple "Action to Take"'s in a single e-mail as well.

Here's examples of what the pattern should not match (or ideally be flexible enough to deal with):

Action to Take: Sell half of your Apple (NYSE: AAPL) April $46 calls for $15.25 or higher. If the spread between the bid and the ask is $0.20 or more, place your order between the bid and the ask - even if the bid is higher than $15.25.

Action to Take: Raise your stop on Apple (NYSE: AAPL) to $75.15.

Action to Take: Sell one-quarter of your Facebook (Nasdaq: FB) position at market. ...

Here's my R code with the latest Perl pattern (to be able to use lookaround in R) that I came up with that sort of works, but not consistently or over multiple saved e-mails:

library(httr)
library("stringr")
filenames <- list.files("R:/TBIRD", pattern="*.eml", full.names=TRUE)

parse <- function(input)
{
text <- readLines(input, warn = FALSE)
text <- paste(text, collapse = "")
trim <- regmatches(text, regexpr("Content-Type: text/plain.*Content-Type: text/html", text, perl=TRUE))

pattern <- "(?is-)(?<=Action to Take).*(?i-s)(Buy|Sell).*(?:\\((?:NYSE|Nasdaq)\\:\\s(\\w+)\\)).*(?:for|at)\\s(\\$\\d*\\.\\d* or|market)\\s"

df <- str_match(text,pattern)
return(df)
}

list <- lapply(filenames, function(x){ parse(x) })
table <- do.call(rbind,list)
table <- data.frame(table)
table <- table[rowSums(is.na(table)) < 1, ]
table <- subset(table, select=c("X2","X3","X4"))

The parsing has to operate on the text copy because the HTML appears way too complicated to do so due to lack of standardization from e-mail to e-mail. Unfortunately, the text copy also commonly tends to have wrong line endings than regexp expects which greatly aggravates things.

Have you seen this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — IRTFM, Dec 22 '16 at 19:33
Yes, which explains why you can't use regex on HTML and I'm not attempting to do so here for all of those [funny] reasons. Its my working belief the core problem is with the regex pattern. To say regex is cryptic and obtuse is an understatement! I'm open to other approaches than regex to get the job done. — MachineGhost, Dec 23 '16 at 19:39

E-Mail Text Parsing and Extracting via Regular Expression in R

0 Answers0