-2

UPDATE

Here is what I have done so far.

library(tm)
library(NLP)
library(SnowballC)
# set directory
setwd("C:\\Users\\...\\Data pretest all TXT")

# create corpus with tm package
pretest <- Corpus(DirSource("\\Users\\...\\Data pretest all TXT"), readerControl = list(language = "en"))

pretest is a large SimpleCorpus with 36 elements. My folder contains 36 txt files.

# check what went in
summary(pretest)

# create TDM
pretest.tdm <- TermDocumentMatrix(pretest, control = list(stopwords = TRUE, 
tolower = TRUE, stemming = TRUE))

# convert corpus to data frame
dataframePT <- data.frame(text = unlist(sapply(pretest, `[`, "content")), 
stringsAsFactors = FALSE)

dataframePT has 36 observations. So I think until here it is okay.

# load stringr library
library(stringr)

# define sentences
v = strsplit(dataframePT[,1], "(?<=[A-Za-z ,]{10})\\.", perl = TRUE)

lapply(v, function(x) (stringr::str_count(x, "gain")))

My output looks like this

... [[35]] [1] NA

[[36]] [1] NA

So there are actually 36 files, so that's good. But I don't know why it returns NA.

Thank you in advance for any suggestions.

vewees
  • 37
  • 6
  • 1
    Please don't post pictures of code. See [how to create a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). That will make it easier to help you. – MrFlick Oct 02 '17 at 15:17
  • Thank you for your answer. I updated my post. – vewees Oct 02 '17 at 15:38

2 Answers2

1

Hi I recommend using filter function from dplyr package and grepl function to search a pattern inside

pattern <- "word1|word2"

    df<- df %>%
      filter(grepl(pattern,column_name)

The df would be limited to only those matching that condition. So then just use nrow function to count how many rows last :)

Example:

a1<-1:10
a2<-11:20
(data<-data.frame(a1,a2,stringsAsFactors = F))
   a1 a2
1   1 11
2   2 12
3   3 13
4   4 14
5   5 15
6   6 16
7   7 17
8   8 18
9   9 19
10 10 20

(data<-data %>%  filter(grepl("5|7",data$a2)))
  a1 a2
1  5 15
2  7 17

(nrow(data))
[1] 2
1
library(NLP) 
library(tm)
library(SnowballC)

Load data:

data("crude")
crude.tdm <- TermDocumentMatrix(crude, control = list(stopwords = TRUE, tolower = TRUE, stemming= TRUE))

First convert corpus to data frame

dataframe <- data.frame(text = unlist(sapply(crude, `[`, "content")), stringsAsFactors = F)

one can also inspect the content: crude[[2]]$content

now we need to define a sentence - here I define it with an entity that has at least 10 A-Z or a-z characters mixed with spaces and "," and ending with ".". And I split the documents by that rule using look-behind the .

z = strsplit(dataframe[,1], "(?<=[A-Za-z ,]{10})\\.", perl = T)

but this is not needed for crude corpus since every sentence ends with .\n so one can do:

z = strsplit(dataframe[,1], "\\.n\", perl = T)

I will stick with my previous definition of sentence since one wants it functioning not only for crude corpus. The definition is not perfect so I am keen on hearing your thoughts?

Lets check the output

z[[2]]
 [1] "OPEC may be forced to meet before a\nscheduled June session to readdress its production cutting\nagreement if the organization wants to halt the current slide\nin oil prices, oil industry analysts said"                                                                  
 [2] "\n    \"The movement to higher oil prices was never to be as easy\nas OPEC thought"                                                                                                                                                                                         
 [3] " They may need an emergency meeting to sort out\nthe problems,\" said Daniel Yergin, director of Cambridge Energy\nResearch Associates, CERA"                                                                                                                               
 [4] "\n    Analysts and oil industry sources said the problem OPEC\nfaces is excess oil supply in world oil markets"                                                                                                                                                             
 [5] "\n    \"OPEC's problem is not a price problem but a production\nissue and must be addressed in that way,\" said Paul Mlotok, oil\nanalyst with Salomon Brothers Inc"                                                                                                        
 [6] "\n    He said the market's earlier optimism about OPEC and its\nability to keep production under control have given way to a\npessimistic outlook that the organization must address soon if\nit wishes to regain the initiative in oil prices"                             
 [7] "\n    But some other analysts were uncertain that even an\nemergency meeting would address the problem of OPEC production\nabove the 15.8 mln bpd quota set last December"                                                                                                  
 [8] "\n    \"OPEC has to learn that in a buyers market you cannot have\ndeemed quotas, fixed prices and set differentials,\" said the\nregional manager for one of the major oil companies who spoke\non condition that he not be named"                                         
 [9] " \"The market is now trying to\nteach them that lesson again,\" he added.\n    David T"                                                                                                                                                                                     
[10] " Mizrahi, editor of Mideast reports, expects OPEC\nto meet before June, although not immediately"                                                                                                                                                                           
[11] " However, he is\nnot optimistic that OPEC can address its principal problems"                                                                                                                                                                                               
[12] "\n    \"They will not meet now as they try to take advantage of the\nwinter demand to sell their oil, but in late March and April\nwhen demand slackens,\" Mizrahi said"                                                                                                    
[13] "\n    But Mizrahi said that OPEC is unlikely to do anything more\nthan reiterate its agreement to keep output at 15.8 mln bpd.\"\n    Analysts said that the next two months will be critical for\nOPEC's ability to hold together prices and output"                       
[14] "\n    \"OPEC must hold to its pact for the next six to eight weeks\nsince buyers will come back into the market then,\" said Dillard\nSpriggs of Petroleum Analysis Ltd in New York"                                                                                        
[15] "\n    But Bijan Moussavar-Rahmani of Harvard University's Energy\nand Environment Policy Center said that the demand for OPEC oil\nhas been rising through the first quarter and this may have\nprompted excesses in its production"                                        
[16] "\n    \"Demand for their (OPEC) oil is clearly above 15.8 mln bpd\nand is probably closer to 17 mln bpd or higher now so what we\nare seeing characterized as cheating is OPEC meeting this\ndemand through current production,\" he told Reuters in a\ntelephone interview"
[17] "\n Reuter" 

and the original:

cat(crude[[2]]$content)
OPEC may be forced to meet before a
scheduled June session to readdress its production cutting
agreement if the organization wants to halt the current slide
in oil prices, oil industry analysts said.
    "The movement to higher oil prices was never to be as easy
as OPEC thought. They may need an emergency meeting to sort out
the problems," said Daniel Yergin, director of Cambridge Energy
Research Associates, CERA.
    Analysts and oil industry sources said the problem OPEC
faces is excess oil supply in world oil markets.
    "OPEC's problem is not a price problem but a production
issue and must be addressed in that way," said Paul Mlotok, oil
analyst with Salomon Brothers Inc.
    He said the market's earlier optimism about OPEC and its
ability to keep production under control have given way to a
pessimistic outlook that the organization must address soon if
it wishes to regain the initiative in oil prices.
    But some other analysts were uncertain that even an
emergency meeting would address the problem of OPEC production
above the 15.8 mln bpd quota set last December.
    "OPEC has to learn that in a buyers market you cannot have
deemed quotas, fixed prices and set differentials," said the
regional manager for one of the major oil companies who spoke
on condition that he not be named. "The market is now trying to
teach them that lesson again," he added.
    David T. Mizrahi, editor of Mideast reports, expects OPEC
to meet before June, although not immediately. However, he is
not optimistic that OPEC can address its principal problems.
    "They will not meet now as they try to take advantage of the
winter demand to sell their oil, but in late March and April
when demand slackens," Mizrahi said.
    But Mizrahi said that OPEC is unlikely to do anything more
than reiterate its agreement to keep output at 15.8 mln bpd."
    Analysts said that the next two months will be critical for
OPEC's ability to hold together prices and output.
    "OPEC must hold to its pact for the next six to eight weeks
since buyers will come back into the market then," said Dillard
Spriggs of Petroleum Analysis Ltd in New York.
    But Bijan Moussavar-Rahmani of Harvard University's Energy
and Environment Policy Center said that the demand for OPEC oil
has been rising through the first quarter and this may have
prompted excesses in its production.
    "Demand for their (OPEC) oil is clearly above 15.8 mln bpd
and is probably closer to 17 mln bpd or higher now so what we
are seeing characterized as cheating is OPEC meeting this
demand through current production," he told Reuters in a
telephone interview.
 Reuter

You can clean it a bit if you wish, removing the trailing \n but it is not needed for your request.

Now you can do all sorts of things, like: Which sentences contain the word "gain"

lapply(z, function(x) (grepl("gain", x)))

or the frequency of word "gain" per sentence:

lapply(z, function(x) (stringr::str_count(x, "gain")))
missuse
  • 19,056
  • 3
  • 25
  • 47
  • Thank you very much for your answer. It looks promising! I will try it out. – vewees Oct 03 '17 at 07:10
  • It works when I replicate the crude example. If I apply the code to my data, the output of the last command returns NA for each document. – vewees Oct 03 '17 at 08:58
  • In that case the problem is not in the solution but with the fact the example data posted does not resemble the data in question. Perhaps if you uploaded one of the pdf's and showed how you build the corpus I might be able to help. – missuse Oct 03 '17 at 09:21
  • I posted by code. Maybe there is a problem with my corpus. Or with the lapply command. Thank you for your help! – vewees Oct 03 '17 at 09:59
  • I solved the problem. My data is the problem actually. I converted pdf files into txt files. The txt files are a bit 'messy'. They have empty lines etc. I notepad++, I did 'remove empty lines' and 'join lines'. The whole text is in one line now in notepad++. With this, it works. I have to manually check now if R returns the correct number of sentences containing my keyword. Let's hope so! Thank you very much for your help:) – vewees Oct 03 '17 at 16:13
  • post an example of the .txt files and I will try to show you how to clean it in R. – missuse Oct 03 '17 at 16:26
  • @vewees I am sorry I am not keen on giving my e-mail just so I can download the file. If you can upload on dropbox or similar I would take a look. – missuse Oct 04 '17 at 07:00