How can I look for specific sentences inside a text in R?

Question

I have a dataset which is plenty of people offering themselves to some jobs.The point is that I want to retrieve from each comment some very specific sentences I have in a .txt file. So far I haven't managed to do it properly.

score.sentiment <- function(sentences, pos.words, .progress='none')
{
  require(plyr)
  require(stringr)
  scores <- laply(sentences, function(sentence, pos.words){
sentence <- gsub('[[:punct:]]', "", sentence)
    sentence <- gsub('[[:cntrl:]]', "", sentence)
    sentence <- gsub('\\d+', "", sentence)
    sentence <- tolower(sentence)
    word.list <- str_split(sentence, '\\s+')
    words <- unlist(word.list)
     pos.matches <- match(words, pos.words)
     score <- pos.matches
    return(score)
  }, pos.words, .progress=.progress)
  scores.df <- data.frame(text=sentences)
  return(scores.df)
}
results <- score.sentiment(sentences = serv$service_description, pos.words)

The text file is called pos.words and it contains sentences in spanish such that:

 tengo 25 años
 tengo 47 años
 tengo 34 años

The other file contains a variable called services which contains a comment per person explaining their abilities, their education and so on. And what I'd like to do is to get their age from the text they have written.

Example from services file:

"Me llamo Adrián y tengo 24 años. He estudiado Data Science y me gusta trabajar en el sector tecnológico"

So from this sample I'd like to get my age. My idea so far has been to create a pos.words.txt with all the possible sentences in spanish stating the age and matching it with the comments file.

The main problems that have arisen so far are that I can't create a correct function to do it; I don't know how to make R to identify whole sentences from pos.words.txt because for the moment it takes every single word as a character. In addition to this, the piece of code I have posted here explaining my function doesn't work (thug life...)

I'd really appreciate some help to tackle this issue!!

Thank you very much for your help!!

Adrian

It would be helpful if you could provide some reproducible examples of what your input txt file and the txt files you're searching through look like once they are imported into R. — AOGSTA, Apr 27 '16 at 00:32
Read this to help guide your reproducible example: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example -- it also helps if you have your code consistently formatted. — Brandon Loudermilk, Apr 27 '16 at 02:19

score 1 · Answer 1 · answered Apr 27 '16 at 04:38

This splits into sentences and captures the last instance of `"tengo años":

inp <- "blah blah blah tengo 25 años more blah.
  Even more blha then tengo 47 años.
  Me llamo Adrián y tengo 34 años."
rl <- readLines(textConnection(inp))  # might need to split at periods
     # Then use a capture class to get the digits flanked by "tengo" and "años"
gsub("^.+tengo[ ](\\d+)[ ]años.+$", "\\1", rl)
[1] "25" "47" "34"

How can I look for specific sentences inside a text in R?

1 Answers1