2

I have been working on some text scraping/analysis. One thing I did was pull out the top words from documents to compare and learn about different metrics. This was fast and easy. There became an issue with defining what separators to use though and pulling out individual words rather than phrases removed information from the analysis. For example .Net Developer becomes net and developer after the transformation. I already had a list of set phrases/words from an old project someone else gave up on. The next step was pulling out specific keywords from multiple rows for multiple documents.

I have been looking into several techniques including vectorization, parallel processing, using C++ code within R and others. Moving forward I will experiment with all of these techniques and try and speed up my process as well as give me these tools for future projects. In the mean time (without experimentation) I'm wondering what adjustments are obvious which will significantly decrease the time taken e.g. moving parts of the code outside the loop, using better packages etc I also have a progress bar, but I can remove it if its slowing down my loop significantly.

Here is my code:

words <- read.csv("keyphrases.csv")
df <- data.frame(x=(list.files("sec/new/")))
total = length(df$x)
pb <- txtProgressBar(title = "Progress Bar", min = 0, max =total , width = 300, style=3)

for (i in df$x){
          s <- read.csv(paste0("sec/new/",i))
          u <- do.call(rbind, pblapply(words$words, function(x){
              t <- data.frame(ref= s[,2], words = stri_extract(s[,3], coll=x))
              t<-na.omit(t)
          }))
          write.csv(u,paste0("sec/new_results/new/",i), row.names = F)
          setTxtProgressBar(pb, i, title=paste( round(which(df$x== i)/total*100, 2),"% done"))
      }

So words has 60,000 rows of words/short phrases - no more than 30 characters each. Length i is around 4000 where each i has between 100 and 5000 rows with each row having between 1 and 5000 characters. Any random characters/strings can be used if my question needs to be reproducible.

I only used lapply because combining it with rbind and do.call worked really well, having a loop within a loop may be slowing down the process significantly too.

So off the bat there are somethings I can do right? Swapping data.frame to data.table or using vectors instead. Do the reading and writing outside the loop somehow? Perhaps write it such that one of the loops isnt nested?

Thanks in advance

EDIT

The key element that needs speeding up is the extract. Whether I use lapply above or cut it down to:

for(x in words$words){t<-data.table(words=stri_extract(s[,3], coll=x))}

This still takes the most time for a long way. skills and t are data tables in this case.

EDIT2

Attempting to create reproducible data:

set.seed(42)    
words <- data.frame(words=rnorm(1:60000))
    words$wwords <- as.String(words$words)

set.seed(42)
     file1 <- data.frame(x=rnorm(1:5000))
     file1$x<-as.String(file1$x)

     pblapply(words$words, function(x){
         t <- data.frame(words = stri_extract(file1$x, coll=x))
     })
Oli
  • 532
  • 1
  • 5
  • 26
  • It would be helpful if you would actually provide some toy data to be able to execute your code, see http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – majom Aug 05 '16 at 12:47
  • I cant provide my data, I put the sizes of the files in my question, someone who can answer this question will be far better at creating random strings/files than me. Maybe its as easy as making random vectors of numbers, ive not attempted it before. – Oli Aug 05 '16 at 12:49
  • I was talking about preparing TOY DATA. Experience shows that people are more likely to jump on a question if they can run your code first - without the need to simulate the data on their own. – majom Aug 05 '16 at 12:52
  • Attempted this in second edit. Wouldnt know how to give people folders/files though. Just a basic example with one file being read. – Oli Aug 05 '16 at 13:39
  • @OliPaul please use `set.seed` when generating data via such functions as `rnorm` for example. – Sotos Aug 05 '16 at 13:41
  • The values aren't relevant though? – Oli Aug 05 '16 at 13:41
  • This is so we all get the same ones so we can potentially compare answers if needed – Sotos Aug 05 '16 at 13:43
  • To me the folder thing overly complicates your question, why don't you assume for the stackoverflow example that every file is an element in a list. Thus, you can provide toy code that everyone can run and optimize and it is trivial for you to change the code afterwards to include the read/write.csv statements (when you got an optimized version the really important part of your code). – majom Aug 05 '16 at 13:44
  • Thats why edit 2 is only using one file. The writing and reading part of the loop is insignificant to be honest. really I need to find a better way of scanning through one data frame looking for strings from another. Stri_extract works but is slow. Each file with 5000 rows can take up to 15 hours! – Oli Aug 05 '16 at 13:50

2 Answers2

3

First things first. Yes, I would definitely switch from data.frame to data.table. Not only is it faster and easier to use, when you start merging data sets data.table will do reasonable things when data.frame will give you unexpected and unintended results.

Secondly, is there an advantage to using R to take care of your separators? You mentioned a number of different techniques you are considering using. If separators are just noise for the purposes of your analysis, why not split the work into two tools and use a tool that is much better at handling separators and continuation lines and so on? For me, Python is a natural choice to do things like parsing a bunch of text into keywords--including stripping off separators and other "noise" words you do not care about in your analysis. Feed the results of the Python parsing into R, and use R for its strengths.

There are a few different ways to get the output of Python into R. I would suggest starting off with something simple: CSV files. They are what you are starting with, they are easy to read and write in Python and easy to read in R. Later you can deal with a direct pipe between Python and R, but it does not give you much advantage until you have a working prototype and it is a lot more work at first. Make Python read in your raw data and turn out a CSV file that R can drop straight into a data.table without further processing.

As for stri_extract, it is really not the tool you need this time. You certainly can match on a bunch of different words, but it is not really what it is optimized for. I agree with @Chris that using merge() on data.tables is a much more efficient--and faster--way to search for a number of key words.

  • can't agree - `stringi` will be not worse than base python for text cleaning / processing. When you will have properly tokenized text - feed it to text2vec package - it will easily do rest of the job. – Dmitriy Selivanov Aug 10 '16 at 10:57
  • Well isnt the Rcpp package used to run c++ within R. Just not sure if my code can be translated into C++, never used it. – Oli Aug 12 '16 at 10:57
1

Single Word Version

When you have single words in each lookup, this is easily accomplished with merging:

library(data.table)

#Word List
set.seed(42)
WordList <- data.table(ID = 1:60000, words = sapply(1:60000, function(x) paste(sample(letters, 5), collapse = '')))

#A list of dictionaries
set.seed(42)
Dicts <- list(
  Dict1 = sapply(1:15000, function(x) {
    paste(sample(letters, 5), collapse = '')
  }),
  Dict2 = sapply(1:15000, function(x) {
    paste(sample(letters, 5), collapse = '')
  }),
  Dict3 = sapply(1:15000, function(x) {
    paste(sample(letters, 5), collapse = '')
  })
)

#Create Dictionary Data.table and add Identifier
Dicts <- rbindlist(lapply(Dicts, function(x){data.table(ref = x)}), use.names = T, idcol = T)

# set key for joining
setkey(WordList, "words")
setkey(Dicts, "ref")

Now we have a data.table with all dictionary words, and a data.table with all words in our word list. Now we can just merge:

merge(WordList, Dicts, by.x = "words", by.y = "ref", all.x = T, allow.cartesian = T)
       words    ID   .id
    1: abcli 30174 Dict3
    2: abcrg 26210 Dict2
    3: abcsj  8487 Dict1
    4: abczg 24311 Dict2
    5: abdgl  1326 Dict1
   ---                  
60260: zyxeb 52194    NA
60261: zyxfg 57359    NA
60262: zyxjw 19337 Dict2
60263: zyxoq  5771 Dict1
60264: zyxqa 24544 Dict2

So we can see abcli appears in Dict3, while zyxeb does not appear in any of the dictionaries. There look to be 264 duplicates (words that appear in >1 dictionary), as the resultant data.table is larger than our word list (60264 > 60000). This is shown as follows:

merge(WordList, Dicts, by.x = "words", by.y = "ref", all.x = T, allow.cartesian = T)[words == "ahlpk"]
   words    ID   .id
1: ahlpk  7344 Dict1
2: ahlpk  7344 Dict2
3: ahlpk 28487 Dict1
4: ahlpk 28487 Dict2

We also see here that duplicated words in our word list are going to create multiple resultant rows.

This is very very quick to run

Phrases + Sentences

In the case where you are searching for phrases within sentences, you will need to perform a string match instead. However, you will still need to make n(Phrases) * n(Sentences) comparisons, which will quick hit memory limits in most R data structures. Fortunately, this is an embarrassingly parallel operation:

Same setup:

library(data.table)
library(foreach)
library(doParallel)


# Sentence List
set.seed(42)
Sentences <- data.table(ID = 1:60000, Sentence = sapply(1:60000, function(x) paste(sample(letters, 10), collapse = '')))

# A list of phrases
set.seed(42)
Phrases <- list(
  Phrases1 = sapply(1:15000, function(x) {
    paste(sample(letters, 5), collapse = '')
  }),
  Phrases2 = sapply(1:15000, function(x) {
    paste(sample(letters, 5), collapse = '')
  }),
  Phrases3 = sapply(1:15000, function(x) {
    paste(sample(letters, 5), collapse = '')
  })
)

# Create Dictionary Data.table and add Identifier
Phrases <- rbindlist(lapply(Phrases, function(x){data.table(Phrase = x)}), use.names = T, idcol = T)

# Full Outer Join
Sentences[, JA := 1]
Phrases[, JA := 1]

# set key for joining
setkey(Sentences, "JA")
setkey(Phrases, "JA")

We now want to break up our Phrases table into manageable batches

cl<-makeCluster(4)
registerDoParallel(cl)

nPhrases <- as.numeric(nrow(Phrases))
nSentences <- as.numeric(nrow(Sentences))

batch_size <- ceiling(nPhrases*nSentences / 2^30) #Max data.table allocation is 2^31. Lower this if you are hitting memory allocation limits
seq_s <- seq(1,nrow(Phrases), by = floor(nrow(Phrases)/batch_size))
ln_s <- length(seq_s)
if(ln_s > 1){
  str_seq <- paste0(seq_s,":",c(seq_s[2:ln_s],nrow(Phrases) + 1) - 1)
} else {
  str_seq <- paste0(seq_s,":",nrow(Phrases))
}
  

We are now ready to send our job out. The grepl line below is doing the work - testing which phrases match each sentence. We then filter out any non-matches.

ls<-foreach(i = 1:ln_s) %dopar% {
  
  library(data.table)
  TEMP_DT <- merge(Sentences,Phrases[eval(parse(text = str_seq[1]))], by = "JA", allow.cartesian = T)
  TEMP_DT <- TEMP_DT[, match_test := grepl(Phrase,Sentence), by = .(Phrase,Sentence)][match_test == 1]
  return(TEMP_DT)
  
}

stopCluster(cl)


DT_OUT <- unique(do.call(rbind,ls))

DT_OUT now summarizes the sentences that match, along with the Phrase + Phrase list that it is found in.

This still will take some time (as there is a lot of processing that is necessary) , but nowhere near a year.

Community
  • 1
  • 1
Chris
  • 6,302
  • 1
  • 27
  • 54
  • The main issue is that @oil-paul don't know exact method on how to split text into words/phrases. – Dmitriy Selivanov Aug 06 '16 at 07:25
  • One of this lists is short phrases and the other can be a sentence or even paragraphs. This method only works if the strings match, not if one is contained in the other – Oli Aug 08 '16 at 07:08
  • @OliPaul A few questions: does only one word need to match across the two lists, or does the phrase in list 1 need to be fully contained within the sentance in list 2? In my example, is list 1 (the one with phrases) `WordList` or `Dict 1-3`? – Chris Aug 08 '16 at 12:57
  • For example I would want the phrase "hard working" to be pulled (or counted) from a row with the sentence "Are you hard working?" – Oli Aug 08 '16 at 13:02
  • I will test this! Quick mention of something I may have missed out. When I say sentence, I mean its free space that others have written into. There may be new lines, tabs, spaces and all types of special characters. Where I did work cleaning the data and pulling out the top words found I can split the data such that its always a line per row, or will your technique work on any space of available text? – Oli Aug 09 '16 at 09:02
  • @OliPaul It should work for any contiguous string - so if you want to match a "sentence" that has a line break in it, the phrase needs to have a line break as well. If this is not what you want, you need to remove those characters from your text string prior to running the code – Chris Aug 09 '16 at 13:47
  • Regex error as the phrase list contains at least one string thats read as regex and not a char. First error: Error in { : task 1 failed - "invalid regular expression 'c++ ', reason 'Invalid use of repetition operators'" – Oli Aug 10 '16 at 14:00
  • @OliPaul look at `?grepl` - you probably want `fixed = TRUE` and `ignore.case = TRUE` - so would be `grepl(Phrase,Sentence, fixed = TRUE, ignore.case = TRUE)` – Chris Aug 10 '16 at 14:05
  • Error in summary.connection(connection) : invalid connection – Oli Aug 10 '16 at 14:26
  • Although that seems like a cluster issue. It does run that way when using these two lines without ls: TEMP_DT <- merge(Sentences,Phrases[eval(parse(text = str_seq[1]))], by = "JA", allow.cartesian = T) + TEMP_DT <- TEMP_DT[, match_test := grepl(Phrase,Sentence, fixed=TRUE, ignore.case=TRUE), by = .(Phrase,Sentence)][match_test == 1] – Oli Aug 10 '16 at 14:29
  • @OliPaul yep that is a seperate issue. Make sure you run `stopCluster(cl)` and then `cl<-makeCluster(4)` `registerDoParallel(cl)` before re-running the parallel code – Chris Aug 10 '16 at 14:41
  • I restarted just using your sample data. Error in { : task 1 failed - "cannot allocate vector of size 3.4 Gb" However I noticed you were using larger sizes than I had so I did it again using Phrases as a list of 60000 but Sentences was only 5000. However I got cannot allocate vector size of 1.1Gb – Oli Aug 10 '16 at 14:45
  • Strange because memory.size(max=NA) is 16272 – Oli Aug 10 '16 at 14:50
  • @OliPaul because this is parallelized, you need enough memory for each of your workers. You can change the `batch_size <- ceiling(nPhrases*nSentences / 2^30)` line (alter the 30 to 29, 28 etc) if you are running into memory issues – Chris Aug 10 '16 at 14:55
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/120670/discussion-between-oli-paul-and-chris). – Oli Aug 11 '16 at 08:10