1

I am trying to create wordcloud of SQL queires. and it is standard approach to use underscore in table name and columnname. and I wan to show that information in word cloud. However current code is removing it although I have explicitly wrote not to remove punctuation.

File1.txt:

SQL_query_NEW
"SELECT 0 AS c1 , D1.c2 AS c2 , D1.c3 AS c3 , D1.c4 AS c4 , D1.c5 AS c5 , D1.c6 AS c6 , D1.c7 AS c7 , D1.c8 AS c8 , D1.c1 AS c9 FROM ( SELECT DISTINCT CASE WHEN T7267472.""PABC_DT"" > T7267432.""PEINSTL_DT"" THEN NULL ELSE T7267432.""XYZ_DT"" END AS c1 , T7267472.""ABC_DT"" AS c2 , T7267472.""SID"" AS c3 , T7267488.""CITY"" AS c4 , ( COALESCE( T7267563.""P_KEY"" , '' )  ) || '-' || ( COALESCE( T7267563.""PRD_LNG_DESC"" , '' )  ) AS c5 , T7267563.""P_KEY"" AS c6 , T7267589.""L6_DESC"" AS c7 , T7267589.""G_L3_DESC"" AS c8 FROM ""E_R_S"".""G_ADD_V"" T7267488 , ""E_R_S"".""S_G_AST_F_V"" T7267472 , ""E_R_S"".""G_G_E_S4_D1_V"" T7267589 , ""E_R_S"".""PD_MN_HR_D1_V"" T7267563 , ""E_R_S"".""S_G_AST_D_F_V"" T7267432 "

Code So Far:

library(RODBC)
library(tm)
library(SnowballC)
library(wordcloud)

qryTxt <- read.table("C://File1.txt",sep="\t", header=TRUE)
vectorSQL = qryTxt$SQL_query_NEW
SQLCorpus <- Corpus(VectorSource(vectorSQL))
tdm <- TermDocumentMatrix(SQLCorpus,control = list(verbose = FALSE,
                                                   asPlain = FALSE,
                                                   stopwords = FALSE,
                                                   tolower = TRUE,
                                                   removeNumbers = FALSE,
                                                   stemWords = FALSE,
                                                   removePunctuation = FALSE,
                                                   removeSeparators = FALSE,
                                                   stem = FALSE,
                                                   stripWhitespace = FALSE))

matrix <- as.matrix(tdm)
v <- sort(rowSums(matrix),decreasing = TRUE)
d <- data.frame(word= names(v),freq=v)

wordcloud(d$word,v, scale = c(5,1),max.words = 10, random.order = FALSE,colors = brewer.pal(8, "Dark2"),rot.per = 0.35,use.r.layout = F)

You can see removePunctuation as False. still it is removing underscore in output.

d

word                    freq
t7267472    t7267472    4       
t7267563    t7267563    4       
desc    desc    3       
t7267432    t7267432    3       
t7267589    t7267589    3       
ast ast 2       
coalesce    coalesce    2       
from    from    2       
key key 2       
select  select  2   
Bhavesh Ghodasara
  • 1,981
  • 2
  • 15
  • 29
  • 1
    Please make a reproducible example. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – emilliman5 Jun 28 '17 at 12:40
  • @emilliman5done some changes. I am not able to produce all queries or word cloud output. However you can see dataframe out put. word "ast" doesn't appear alone anywhere it was always with underscore. Hope this helps. – Bhavesh Ghodasara Jun 28 '17 at 13:00
  • A reproducible example is one where I can copy and paste the code from your post and reproduce your problem. As it stands I cannot do that because I there is no input data. – emilliman5 Jun 28 '17 at 14:04

1 Answers1

1

I was having the same exact problem. The controls in the list do not fix it.

You will have to use VCorpus() instead of Corpus().

In your example, change this SQLCorpus <- Corpus(VectorSource(vectorSQL)) to this:

SQLCorpus <- VCorpus(VectorSource(vectorSQL))

Then, underscores, dashes and any other punctuation character will show up. After that, you will have to apply the controls to get rid of those punctuation characters that you don't want.

f0nzie
  • 1,086
  • 14
  • 17