First, I create a table with frequencies of two-letter-combinations out of a German text file:
# Load a German language text file
text <- read_tsv("http://www.reduts.net/deutsch.txt")
colnames(text) <- c("id", "text")
text <- paste(text$text, collapse = " ")
# Calculate all two-letters-combinations
tokens_char <- function(str, window = 2) {
str <- stringi::stri_replace_all_regex(str, "\\W", "")
str <- tolower(str)
win <- window - 1
len1 <- seq_len(nchar(str) - win)
stringi::stri_sub(str, from = len1, to = len1 + win)
}
This creates a lookup-table containing two columns: 1. the two-letter-combination and 2. the frequency of all combinations appearing in the text:
lookuptable <- tibble(
token = tokens_char(text, window = 2)
) %>% count(token, sort = TRUE) %>%
mutate(token2 = token) %>%
separate(token, into = c("first", "second"), sep = 1) %>%
group_by(first) %>%
mutate(total = sum(n),
freq = n / total) %>%
ungroup() %>%
mutate(token = token2,
token2 = NULL,
first = NULL,
second = NULL,
total = NULL) %>%
select(token, freq)
> lookuptable
# A tibble: 1,522 x 2
token freq
<chr> <dbl>
1 en 0.233
2 er 0.225
3 ch 0.861
4 de 0.446
5 ei 0.127
6 te 0.302
7 nd 0.186
8 in 0.228
9 ie 0.209
10 ge 0.494
# ... with 1,512 more rows
Then, I have a function that calculates the product of the frequencies of all 2-letter-combinations for a given text. For example for the word "test" I lookup the probabilities of "te" "es" and "st". These probabilities are then multiplied: P("te") * P("es") * P("st"):
lookup_text <- function(text = ""){
df <- data.frame(token = tokens_char(text, window = 2)) %>%
left_join(lookuptable, by = "token")
# Return product of all probabilities
return(prod(df$freq))
}
Now, I can easily check how probable a given text is gibberish or real.
> lookup_text("test")
[1] 0.004262462
There is just one major drawback: Obviously the value I get is heavily depending on the length of the string I want to check. So my question is: How can I fix this?
In this threat (Is there any way to detect strings like putjbtghguhjjjanika?) someone writes: "Then normalize by the length of the query." But how can this be done? Thanks for your help!