4

I've a string with a list of words, and I want to get all the possible word combinations from it.

fruits <- "Apple Banana Cherry"

To get this output:

"Apple, Banana, Cherry, Apple Banana, Apple Cherry, Banana Cherry, Apple Banana Cherry"

Using the function defined here, slightly modified:

f1 <- function(str1){
  v1 <- strsplit(str1, ' ')[[1]]
  paste(unlist(sapply(seq(length(v1)), function(i)
    apply(combn(v1, i), 2, paste, collapse=" "))), collapse= ', ')
}

f1(fruits)

This works fine when there's relatively few rows, but the real-life example has a total of 93,300 characters across 3,350 rows with an median string length of 25 characters, causing an error similar to this:

Error in paste(unlist(sapply(seq(length(v1)), function(i) apply(combn(v1, : result would exceed 2^31-1 bytes

I tried changing utils::combn to RcppAlgos::comboGeneral within the function , because it's apparently quicker, but still encountered the same issue. Any suggestions for ways around this?

rsylatian
  • 429
  • 2
  • 14
  • 1
    This seems like it will be a very large number, more than most computers could manage. Roughly speaking, even if you allow only up to 10 word-long combinations, wouldn't you already have on the order of 3,000 ^ 10 options, or about 10^44? That would take around 1 petabyte just to list those numbers, let alone assign strings to them. Perhaps there are more constraints you haven't mentioned which could make this easier? – Jon Spring Apr 03 '19 at 19:49
  • This will be a spectacularly huge string. Even with just 1,000 words, there are about 2.7E299 500-word combinations. Including all subsets of all lengths would be about 1.07E301 combinations. – eipi10 Apr 03 '19 at 19:50
  • I'm using a large EC2 instance for it. – rsylatian Apr 03 '19 at 19:55
  • 1
    Just out of curiosity, what are you planning to do with this string of all possible combinations once you've created it? If you give us more context, we might be able to come up with a more efficient solution than generating one giant string of combinations. – eipi10 Apr 03 '19 at 20:26
  • The purpose of it was for gap analysis with keywords, by anti-joining partial matches in one set of keywords against another. – rsylatian Apr 03 '19 at 23:21
  • To contextualize @eipi10's calculation, there might be around 2.5M petabytes (2.5 x 10^21 bytes) of total storage on all hard drives worldwide. https://www.quora.com/How-many-gigabytes-of-storage-are-in-the-whole-world You would need many more than a Googol of Earths (including ALL the EC2 instances and every other computer too) to find enough storage space for all the combinations! – Jon Spring Apr 04 '19 at 06:04
  • ...which is all to say it would be useful to add more context to your question, since this approach will quickly become impractical with Earth-based technology of the early 21st century. – Jon Spring Apr 04 '19 at 06:09

3 Answers3

2

We have a pretty efficient function for vectorized skipgrams and ngrams in quanteda. Try this one, with multi-threading for efficiency (you can change the threads to your system's maximum):

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
quanteda_options(threads = 4)

fruits <- "Apple Banana Cherry"
tokens(fruits) %>%
  tokens_skipgrams(., n = seq_len(ntoken(.)), skip = 0:ntoken(.), concatenator = " ") %>%
  as.character() %>%
  paste(collapse = ", ")
## [1] "Apple, Banana, Cherry, Apple Banana, Apple Cherry, Banana Cherry, Apple Banana Cherry"
Ken Benoit
  • 14,454
  • 27
  • 50
2

If you have three words

fruits <- "Apple Banana Cherry"

the combinations could be expressed by using a 0 or 1 to denote inclusion of each word. This would mean that with three words you have 2^3 - 1 = 7 options, excluding null:

001 Cherry
010 Banana
011 Banana, Cherry
100 Apple
101 Apple, Cherry
110 Apple, Banana
111 Apple, Banana, Cherry

So we can think of this as counting in binary. All three-word combinations can be expressed with three bits, and there are 2^3 - 1 = 7 options.

The problem with storing every combination is that the length of this list will double with every additional word. By the time you have 80 words, it will take 80 bits to express each possible combination, but there will be 2^80 - 1 = around 1,200,000,000,000,000,000,000,000 (1.2E24) different possible combinations, which would take more space than all the hard drives in the world.

I don't mean to imply that this is an unsolvable problem, and it's not my area of experience to judge whether the other answers will do what you want in an efficient way, but I just wanted to observe that there will be physical constraints making it impractical to pre-calculate and store all the possible combinations in the way the question proposes.

Jon Spring
  • 55,165
  • 4
  • 35
  • 53
1

To keep it simple with the question, I omitted that what I was ultimately looking to do was to create a list of these combinations.

I also didn't know the name of this was tokenisation with a Skip-Gram. While ultimately still slow, this solution avoids the R memory error and with enough compute power, it does the trick:

library(tokenizers)
unlist(tokenize_skip_ngrams(fruits, n = 3, n_min = 1, k = 3))
rsylatian
  • 429
  • 2
  • 14