Digits being neglected while performing N-gram in R

Question

I want to get the counts of all character level Ngrams presnt in a text file. Using R I wrote a small code for the same. However the code is neglecting all the digits present in the text. Could anyone help me in fixing this issue.

Here is the code :

 library(tau)
temp<-read.csv("/home/aravi/Documents/sample/csv/ex.csv",header=TRUE,stringsAsFactors=F)
r<-textcnt(temp, method="ngram",n=4L, decreasing=TRUE)
a<-data.frame(counts = unclass(r), size = nchar(names(r)))
b<-split(a,a$size)
b

Here is the contents of the input file:

abcd123
appl2345e
coun56ry
live123
names3423bsdf
coun56ryas

This is the output:

  $`1`
  counts size
_     18    1
a      3    1
e      3    1
n      3    1
s      3    1
c      2    1
l      2    1
o      2    1
p      2    1
r      2    1
u      2    1
y      2    1
b      1    1
d      1    1
f      1    1
i      1    1
m      1    1
v      1    1

$`2`
   counts size
_c      2    2
_r      2    2
co      2    2
e_      2    2
n_      2    2
ou      2    2
ry      2    2
s_      2    2
un      2    2
_a      1    2
_b      1    2
_e      1    2
_l      1    2
_n      1    2
am      1    2
ap      1    2
as      1    2
bs      1    2
df      1    2
es      1    2
f_      1    2
iv      1    2
l_      1    2
li      1    2
me      1    2
na      1    2
pl      1    2
pp      1    2
sd      1    2
ve      1    2
y_      1    2
ya      1    2

$`3`
    counts size
_co      2    3
_ry      2    3
cou      2    3
oun      2    3
un_      2    3
_ap      1    3
_bs      1    3
_e_      1    3
_li      1    3
_na      1    3
ame      1    3
app      1    3
as_      1    3
bsd      1    3
df_      1    3
es_      1    3
ive      1    3
liv      1    3
mes      1    3
nam      1    3
pl_      1    3
ppl      1    3
ry_      1    3
rya      1    3
sdf      1    3
ve_      1    3
yas      1    3

$`4`
     counts size
_cou      2    4
coun      2    4
oun_      2    4
_app      1    4
_bsd      1    4
_liv      1    4
_nam      1    4
_ry_      1    4
_rya      1    4
ames      1    4
appl      1    4
bsdf      1    4
ive_      1    4
live      1    4
mes_      1    4
name      1    4
ppl_      1    4
ryas      1    4
sdf_      1    4
yas_      1    4

Could anyone tell what am I missing or where I went wrong. Thanks in Advance.

My guess is that the default value for `splits` in `textcnt` includes "digits" , so numbers are being treated as delimiters. I've never used this package, so just a guess. — Carl Witthoft, Jul 19 '13 at 11:40
If you make your problem reproducible (see http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), there will be more people able to help you. — Roman Luštrik, Jul 19 '13 at 11:44
@CarlWitthoft Awesome guess. That was the problem. The splits was considering digits as a delimiter. Thanks a lot for the help. :) — Aravind Asok, Jul 19 '13 at 12:15
@CarlWitthoft Also since you gave the correct solution, could you please add your comment as an answer so that I can accept it. — Aravind Asok, Jul 19 '13 at 12:19

score 1 · Accepted Answer · answered Jul 19 '13 at 12:37

1

The default value for splits in textcnt includes "digits" , so numbers are being treated as delimiters. Remove that and things will work.

answered Jul 19 '13 at 12:37

Carl Witthoft

20,573
9
43
73

Digits being neglected while performing N-gram in R

1 Answers1