3

I am trying to split a word into bi-grams. I am using the qlcMatrix package, but it only returns distinct bi-grams. For example, for the word "detected", it only returns "te" once. This is the command I use

test_domain <- c("detected")
library("qlcMatrix", lib.loc="~/R/win-library/3.2")
bigram1 <- splitStrings(test_domain, sep = "", bigrams = TRUE, left.boundary = "", right.boundary = "")$bigrams

and this is the result I get:

bigram1
# [1] "ec" "ed" "de" "te" "ct" "et"
smci
  • 32,567
  • 20
  • 113
  • 146
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • 2
    The `$bigrams` returns "A vector will all unique bigrams", so it's normal that there are no duplicates – etienne Dec 04 '15 at 08:11
  • Indeed. So I guess this package won't do the trick...? I am trying not to use packages in general but I 've been stack (pun intended) for a while now. – Sotos Dec 04 '15 at 08:19
  • To be clear, you want to have this : `"de" "et" "te" "ec" "ct" "te" "ed"` ? – etienne Dec 04 '15 at 08:21
  • Exactly. Not only the distinct bi-grams. – Sotos Dec 04 '15 at 08:24

2 Answers2

7

Another way to do it with base R is to use mapply and substr:

nc <- nchar("detected")
mapply(function(x, y){substr("detected", x, y)}, x=1:(nc-1), y=2:nc)
# [1] "de" "et" "te" "ec" "ct" "te" "ed"
Cath
  • 23,906
  • 5
  • 52
  • 86
5

You can do that without packages:

test_domain <- c("detected")
temp <- strsplit(test_domain ,'')[[1]]
sapply(1:(length(temp)-1), function(x){paste(temp[x:(x+1)], collapse='')})
# [1] "de" "et" "te" "ec" "ct" "te" "ed"
Cath
  • 23,906
  • 5
  • 52
  • 86
etienne
  • 3,648
  • 4
  • 23
  • 37