1

I am dealing with db with around 5lac+ records. I want to count the words in the data. This is my code

library(tm)
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv,user="postgres",password="root", dbname="pharma",host="localhost",port=5432)

query<-"select data->'PubmedArticleSet'->'PubmedArticle'->'MedlineCitation'->'Article'->'Journal'->>'Title' from searchresult where id BETWEEN 1 AND (select max(id) from searchresult)"
der<-dbGetQuery(con,query)
der<- VectorSource(der)
der<- Corpus(der)
der<-tolower(der) 
wordlist<-strsplit(der, "\\W+", perl=TRUE)
wordvector<-unlist(wordlist)
freqlist<-table(wordvector)
sortedfreqlist<-sort(freqlist, decreasing=TRUE)
sortedtable<-paste(names(sortedfreqlist),sortedfreqlist, sep="\t")
cat("Word\tFrequency", sortedtable, file=choose.files(), sep="\n")

But the code hangs and stops at " wordlist<-strsplit(der, "\\W+", perl=TRUE)" can some one please help me with this? Is this because of the huge data?

Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
SHRUTAYU Kale
  • 369
  • 1
  • 3
  • 10
  • can you give a little sample of what `der` looks like? – Dominic Comtois Mar 12 '15 at 10:36
  • 3
    Please note that "lac" is used only in India. You might want to change that for the international audience. – Roland Mar 12 '15 at 10:50
  • this is what der looks like: "Gynecologic and obstetric investigation" "The lancet. Diabetes & endocrinology" "The lancet. Diabetes & endocrinology" "The lancet. Diabetes & endocrinology" "The lancet. Diabetes & endocrinology" "Canadian journal of physiology and pharmacology" "Social psychiatry and psychiatric epidemiology" "Gynecologic and obstetric investigation" "The lancet. Diabetes & endocrinology" "The lancet. Diabetes & endocrinology" "The lancet. Diabetes & endocrinology" "The lancet. Diabetes & endocrinology" the code runs perfectly for about 1.5lakhs records but hangs for 5lakhs – SHRUTAYU Kale Mar 12 '15 at 10:50
  • And how big is it? R pulls all data in memory so simply reading a big file will take a lot of time. Besides, if you are pulling the data from a database, why do you convert all rows to one big string then try to split it again? It's better to process each row separately or even write a SQL statement that returns exactly what you want. – Panagiotis Kanavos Mar 12 '15 at 10:51
  • @panagiotis the data is in json and has sentences and I need to count the occurrences of each word hence the split and join – SHRUTAYU Kale Mar 12 '15 at 10:54
  • Unless `lac` means million, this isn't a big data problem, it's a relatively small dataset. Million would make it an average-sized problem. If you are trying to count the number of words in database records, use the database's functions to split and count words. PostgreSQL even has regular expressions that could help – Panagiotis Kanavos Mar 12 '15 at 10:55
  • Your code shows a database query. This means that the data is in a database. Unless the database itself contains json instead of text, there's no need to use one big string. Even if it does, you can still process the records one at a time. The only thing that "forces" the split is that you call `VectorSource` and `Corpus` before counting frequencies. – Panagiotis Kanavos Mar 12 '15 at 10:59

1 Answers1

0

Try replacing

wordlist<-strsplit(der, "\\W+", perl=TRUE)

with

word_vector = scan(text = as.character(der[1]), 
    what = "character", quote = "", quiet = TRUE)

sorted_word_table = sort(table(word_vector ))

There are a few funny things going on in your code (ie you make a Corpus and then call tolower() on the whole thing which turns it into a character vector), but this should get you going.

The first bit splits your text up into words. You might also want to remove punctuation before you do this though using der = removePunctuation(der[1]). The second bit makes a table of the word frequencies.

If the second bit is slow you could use the data.table package and the following function based on this answer instead of calling table()

t_dt <- function(x, key = TRUE){ 
    #creates a 1-d frequency table for x
    library(data.table)
    dt <- data.table(x)
    if(key) setkey(dt,x)    
    tab <- dt[, list(freq = .N), by = x] 
    out <- tab$freq
    names(out) <- tab$x
    out 
}

sorted_word_table = sort(t_dt(word_vector ))
Community
  • 1
  • 1
Tom Liptrot
  • 2,273
  • 21
  • 23