How do I choose a random letter, 2 letters, 3 letters, ..., words with the most letters from each sentence in R?

Question

I'm trying to pick a random letter, 2 letters, 3 letters, ..., words with the most letters from each sentence. Then combine these words with a space as a phrase.

new_data <- sample_n(data.frame(stringr::sentences), 30)
new_data

split_data <- data.frame(X = str_remove_all(new_data$stringr..sentences, "[.,]"))
split_data

split_data <- strsplit(split_data$X," ")
split_data

for(i in split_data){
   generated <- split_data %>%
   lapply(nchar)
}

It should have an output like this:

The sentences I randomly selected are

"The long journey home took a year."

"The young prince became heir to the throne."

…

The generated phrases are

“a The year journey”

“to the heir young became”

…

Please provide sample data, read https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. — r2evans, Feb 03 '22 at 04:17

Merijn van Tilborg · Accepted Answer · 2022-02-04T08:22:03.300

solution

library(stringi)

x <- "The theft of the pearl pin was kept secret"

# split string to unique and lowecase words
words<- unique(stri_trans_tolower(stri_extract_all_words(x)[[1]]))

# make it a named vector with character counts
names(words) <- nchar(words)

# apply over your character counts of words and sample 1 of each length
y <- unlist(lapply(unique(sort(names(words))), function (x) {
  sample(words[which(names(words) == x)], 1)
}))

y

#    2        3        4        5        6 
# "of"    "the"   "kept"  "pearl" "secret" 

# make it a sentence again
paste(y, collapse = " ")

# [1] "of the kept pearl secret"

first answer still some useful code on words and characters so I leave it here.

It is not completely clear to what you want to achieve as a final result as you mention a few things, but your final result only shows a few randomized words.

Here a few examples on what you can do with the sentence based on words and on characters..

library(stringi)

x <- "The long journey home took a year."

words <- stri_extract_all_words(x)[[1]]
words
# [1] "The"     "long"    "journey" "home"    "took"    "a"       "year" 

all_letters <- unlist(strsplit(words, ""))
all_letters
# [1] "T" "h" "e" "l" "o" "n" "g" "j" "o" "u" "r" "n" "e" "y" "h" "o" "m" "e" "t" "o" "o" "k" "a" "y" "e" "a" "r"

letter_counts <- rle(sort(stri_trans_tolower(all_letters)))
letter_counts

# Run Length Encoding
#   lengths: int [1:14] 2 4 1 2 1 1 1 1 2 5 ...
#   values : chr [1:14] "a" "e" "g" "h" "j" "k" "l" "m" "n" "o" "r" "t" "u" "y"

l <- 3
sample(letter_counts$values, l)
# [1] "m" "h" "j"

# most occuring letter
letter_counts$values[which(letter_counts$lengths == max(letter_counts$lengths))]
# [1] "o"

n <- 4
paste(sample(words, n), collapse = " ")
# [1] "took a year journey" (or any other random combination of "n" words)

words[which(nchar(words) == max(nchar(words)))] # longest word
# [1] "journey"

what I want to do is; Randomly select from 30 sentences according to the length of the words in each sentence. For example, it will randomly choose from 1-letter words at first, then randomly choose from 2-letter words such as "an", "as", "at". Then for 3-word letters. This continues until the longest word in the sentence. — Yusuf Bilge Özpolat, Feb 03 '22 at 20:28
For example, **"The theft of the pearl pin was kept secret."** after that, I will generate this sentence randomly: **"of the kept theft secret."** As you can see, it gives random based on word length. — Yusuf Bilge Özpolat, Feb 03 '22 at 20:45

Domingo · Answer 2 · 2022-02-04T07:07:48.640

1

Is that what you are looking for

new_data <- dplyr::sample_n(data.frame(stringr::sentences), 30)
new_data

split_data <- data.frame(X =  stringr::str_remove_all(new_data$stringr..sentences, "[.,]"))
    
max_len <- 10

split_data$X |> stringr::str_split("[:space:]") |> 
  purrr::map(
    \(words)
    {
      words_sorted <- words[order(nchar(words))]
      
      tibble::tibble(
        word = words_sorted,
        word_length = nchar(words_sorted)
      ) |> 
        dplyr::filter(word_length <= max_len) |>
        dplyr::group_by(word_length) |>
        dplyr::sample_n(1) |>
        dplyr::pull(word) |>
        paste0(collapse = " ")
    }
  )

it gives you for:

"He picked up the dice for a second roll."

one random word per length:

"a up the dice picked"

if you want to steer the max length word you can change the max_len

edited Feb 04 '22 at 07:07

answered Feb 03 '22 at 10:31

Domingo

613
1
5
15

`words_sorted = c(1:30) max_len <- 10 i <- 0 while(i < 30){ i = i + 1 words_sorted <- split_data[order(nchar(split_data))][[i]] tibble::tibble( word = words_sorted, word_length = nchar(words_sorted) ) |> dplyr::filter(word_length <= max_len) |> dplyr::group_by(word_length) |> dplyr::sample_n(1) |> dplyr::pull(word) |> paste0(collapse = " ") } words_sorted` I tried to run the function you wrote in a loop and use it for 30 different sentences. It works normally when not in the loop, but not inside the loop, what do you think? – Yusuf Bilge Özpolat Feb 03 '22 at 21:41
1

can be that the code highlight was wrong, I fixed that, now it runs with your defined `new_data` – Domingo Feb 04 '22 at 07:10

How do I choose a random letter, 2 letters, 3 letters, ..., words with the most letters from each sentence in R?

2 Answers2