1

I am calculating the number of possible words given a list of strings of syllable combinations. The syllable combination list looks like this:

syllable_combinations <- c("C", "CC", "CCCV-CCV", "CCCV-CCV-CV", "CCCV-CV-CCV", "CCCV-CCV-CCV-CV", "CCCV-CC-CV", "CCCV-CCV-C", "CCCV-CV", "CV-C-CCCV")

On the basis of this list, I'd like to calculate the number of possible words in English given phonotactic rules. To do this, I need to go through the individual items in the syllable combinations list and calculate the number of possible words given that syllable syllable combination.

To generate the number of possible words for a given syllable combination, I need to go through the syllable combination and look at each character in turn in relation to its environment. For the first syllable combination, for instance, I need to do the following:

  1. identify that this word starts with a single consonant C (rather than 2 or 3 consonants);
  2. identify that this first single consonant is followed by a vowel V;
  3. identify that the word continues with a next syllable (indicated by the hyphen);
  4. identify that this second syllable also starts with a single consonant C;
  5. and ends with another vowel V.

This information needs to be connected with information on the sounds that can appear in these positions:

number_of_vowels <- 20
number_of_initial_consonants_length_1 <- 22
number_of_initial_consonants_length_2 <- 47
number_of_final_consonants_length_1 <- 24

In order to calculate the number of possible words with "CVCV" syllable structure in English:

number_of_CVCV_words <- number_of_initial_consonants_length_1*number_of_vowels*number_of_initial_consonants_length_1*number_of_vowels

number_of_CVCV_words
193600

Any advice on how to do this?

I've gotten a bit further with this, but run into some problems.

First, split the syllable combinations into separate syllables:

split_syllables <- c()

for(i in 1:length(syllable_combinations)){
strsplit(as.character(syllable_combinations[i]), split = "-") -> split_syllable
split_syllables <- append(split_syllables, split_syllable)
}

Then, a function that can match each syllable (there is a limited number of unique syllables, so this is doable) (the counter1 variable gives the number of possible sound combinations in English given that particular syllable structure):

detect_syllables <- function(syllable){
if(syllable == "C") {
counter1 <- 25
} else if(syllable == "CC") {
counter1 <- 528
} else if(syllable == "CCCV") {
counter1 <- 200 
} else if(syllable == "CCV") {
counter1 <- 940
} else if(syllable == "CV") {
counter1 <- 440
} else if(syllable == "CVC") {
counter1 <- 10560
} else 
print(syllable, "syllable not matched")
}

Then, functions which carry out the detect_syllables function for each syllable in the orgininal syllable combination:

one_syllable <- function(first_syllable){
lapply(split_syllables[[i]][1], FUN = detect_syllables)
counter1 -> first_syl
first_syl -> number1
print(number1)
}

two_syllables <- function(first_syllable, second_syllable){
lapply(split_syllables[[i]][1], FUN = detect_syllables)
counter1 -> first_syl
lapply(split_syllables[[i]][2], FUN = detect_syllables)
counter1 -> second_syl
first_syl*second_syl -> number2
print(number2) 
}

three_syllables <- function(first_syllable, second_syllable, third_syllable){
lapply(split_syllables[[i]][1], FUN = detect_syllables)
counter1 -> first_syl
lapply(split_syllables[[i]][2], FUN = detect_syllables)
counter1 -> second_syl
lapply(split_syllables[[i]][3], FUN = detect_syllables)
counter1 -> third_syl
first_syl*second_syl*third_syl -> number3
print(number3)
}

four_syllables <- function(first_syllable, second_syllable, third_syllable, fourth_syllable){
lapply(split_syllables[[i]][1], FUN = detect_syllables)
counter1 -> first_syl
lapply(split_syllables[[i]][2], FUN = detect_syllables)
counter1 -> second_syl
lapply(split_syllables[[i]][3], FUN = detect_syllables)
counter1 -> third_syl
lapply(split_syllables[[i]][4], FUN = detect_syllables)
counter1 -> fourth_syl
first_syl*second_syl*third_syl*fourth_syl -> number4
print(number4)
}

And a for loop to make sure that the detect_syllables function is used the appropriately:

for(i in 1:10){
if(length(split_syllables[[i]]) == 1) { 
lapply(split_syllables[[i]][1], FUN = one_syllable)
} else if(length(split_syllables[[i]]) == 2) {
lapply(split_syllables[[i]][1], split_syllables[[i]][2], FUN = two_syllables)
} else if(length(split_syllables[[i]]) == 3) {
lapply(split_syllables[[i]][1], split_syllables[[i]][2], split_syllables[[i]][3], FUN = three_syllables)
} else if(length(split_syllables[[i]]) == 4) {
lapply(split_syllables[[i]][1], split_syllables[[i]][2], split_syllables[[i]][3], split_syllables[[i]][4], FUN = four_syllables)
} else 
print("number of syllables is bigger than 4")
}

However, when I try to use the for loop, I get the following error message:

Error in four_syllables(split_syllables[[1]]) : object 'counter1' not found

I realize this has to with the environment in which 'counter1' is evaluated, as mentioned here: Using get inside lapply, inside a function, but I don't know how to solve it. Neither of the lapply's seem to like it if I try to point them to the right environment (Error in FUN("C"[[1L]], ...) : unused argument(s)).

This required result can be obtained very ineleganty by not using lapply(). If someone has another solution, I'd be happy to learn about it.

for(i in 1:10){
if(length(split_syllables[[i]]) == 1) { 
detect_syllables(split_syllables[[i]][1]) -> counter1
counter1 -> first_syl
first_syl -> number1
print(number1)
} else if(length(split_syllables[[i]]) == 2) {
detect_syllables(split_syllables[[i]][1]) -> counter1
counter1 -> first_syl
detect_syllables(split_syllables[[i]][2]) -> counter1
counter1 -> second_syl
first_syl*second_syl -> number2
print(number2)
} else if(length(split_syllables[[i]]) == 3) {
detect_syllables(split_syllables[[i]][1]) -> counter1
counter1 -> first_syl
detect_syllables(split_syllables[[i]][2]) -> counter1
counter1 -> second_syl
detect_syllables(split_syllables[[i]][3]) -> counter1
counter1 -> third_syl
first_syl*second_syl*third_syl -> number3
print(number3)
} else if(length(split_syllables[[i]]) == 4) {
detect_syllables(split_syllables[[i]][1]) -> counter1
counter1 -> first_syl
detect_syllables(split_syllables[[i]][2]) -> counter1
counter1 -> second_syl
detect_syllables(split_syllables[[i]][3]) -> counter1
counter1 -> third_syl
detect_syllables(split_syllables[[i]][4]) -> counter1
counter1 -> fourth_syl
first_syl*second_syl*third_syl*fourth_syl -> number4
print(number4)
} else 
print("number of syllables is bigger than 4")
}
Community
  • 1
  • 1
Annemarie
  • 689
  • 6
  • 14
  • 28
  • Try some vectorization. E.g., `is.cccv <- split_syllable[[1]]=="CCCV"` will return a vector of TRUE/FALSE (easily converted to 1s and 0s if desired). You might be able to treat `C` and `V` as logical 1,0 in the first place, convert to numeric (so `CVCV` becomes `1010` becomes decimal `10`) and do some tree-sorting work. – Carl Witthoft Jun 21 '13 at 11:42

1 Answers1

0

Not sure I follow everything you want to do, but here's some code that might help you get started.

# save first two syllables
split_combs <- strsplit(syllable_combinations, "-")
syl1 <- sapply(split_combs, "[", 1)
syl2 <- sapply(split_combs, "[", 2)

# function to look at how a string starts
check.start <- function(string, start) {
    # does the string start with this?
    tfn <- substring(string, 1, nchar(start))==start
    tfn[is.na(tfn)] <- FALSE
    tfn
    }

# show all syllable combinations with the first two syllables starting with CV
syllable_combinations[check.start(syl1, "CV") & check.start(syl2, "CV")]
Jean V. Adams
  • 4,634
  • 2
  • 29
  • 46