R: Gibberish detection using Markov model (how to adjust for the text length)

Question

First, I create a table with frequencies of two-letter-combinations out of a German text file:

# Load a German language text file
text <- read_tsv("http://www.reduts.net/deutsch.txt")
colnames(text) <- c("id", "text")
text <- paste(text$text, collapse = " ")

# Calculate all two-letters-combinations
tokens_char <- function(str, window = 2) {
  str <- stringi::stri_replace_all_regex(str, "\\W", "")
  str <- tolower(str)
  win <- window - 1
  len1 <- seq_len(nchar(str) - win)
  stringi::stri_sub(str, from = len1, to = len1 + win)
}

This creates a lookup-table containing two columns: 1. the two-letter-combination and 2. the frequency of all combinations appearing in the text:


lookuptable <- tibble(
    token = tokens_char(text, window = 2) 
  ) %>% count(token, sort = TRUE) %>%
  mutate(token2 = token) %>%
  separate(token, into = c("first", "second"), sep = 1) %>%
  group_by(first) %>%
  mutate(total = sum(n),
         freq = n / total) %>%
  ungroup() %>%
  mutate(token = token2,
         token2 = NULL,
         first = NULL,
         second = NULL,
         total = NULL) %>%
  select(token, freq)

> lookuptable
# A tibble: 1,522 x 2
   token  freq
   <chr> <dbl>
 1 en    0.233
 2 er    0.225
 3 ch    0.861
 4 de    0.446
 5 ei    0.127
 6 te    0.302
 7 nd    0.186
 8 in    0.228
 9 ie    0.209
10 ge    0.494
# ... with 1,512 more rows

Then, I have a function that calculates the product of the frequencies of all 2-letter-combinations for a given text. For example for the word "test" I lookup the probabilities of "te" "es" and "st". These probabilities are then multiplied: P("te") * P("es") * P("st"):


lookup_text <- function(text = ""){
  df <- data.frame(token = tokens_char(text, window = 2)) %>%
    left_join(lookuptable, by = "token") 
  
  # Return product of all probabilities
  return(prod(df$freq))
}

Now, I can easily check how probable a given text is gibberish or real.

> lookup_text("test")
[1] 0.004262462

There is just one major drawback: Obviously the value I get is heavily depending on the length of the string I want to check. So my question is: How can I fix this?

In this threat (Is there any way to detect strings like putjbtghguhjjjanika?) someone writes: "Then normalize by the length of the query." But how can this be done? Thanks for your help!

I think the `tidyverse` tag is wrong here. This is more of a NLP methods question. — JBGruber, Apr 22 '22 at 17:30
Think you might also want to include the discussion showing where [tokens_char](https://stackoverflow.com/questions/71147234/extract-all-two-character-combinations-from-a-string) came from (i.e. not a library function), which also demonstrates the progression of this inquiry. — Chris, Apr 22 '22 at 18:24
Thanks Chris. I didn't even realise that was my function in the question. Nice to see at it at least gets used, even though the answer has never been accepted. — JBGruber, Apr 22 '22 at 19:52

score 0 · Answer 1 · answered Apr 22 '22 at 17:40

I think you are basically looking for the mean instead of the product:

lookup_text <- function(text = ""){
  df <- data.frame(token = tokens_char(text, window = 2)) %>%
    left_join(lookuptable, by = "token") 
  
  return(mean(df$freq))
}

lookup_text(text = "Test")
#> [1] 0.1889944
lookup_text(text = "Tests")
#> [1] 0.1582075
lookup_text(text = "Testkandidaten")
#> [1] 0.1540773
lookup_text(text = "Quer")
#> [1] 0.397371

I included the last example because I noticed some unusual combinations score very high values in your lookup table, for example, "qu". But that is not part of the question.

One problem you need to address is what happens if a letter combination is not in your lookup table. Currently, this results in NA. I would argue the probability should be 0 in this case:

lookup_text <- function(text = ""){
  df <- data.frame(token = tokens_char(text, window = 2)) %>%
    left_join(lookuptable, by = "token") 
  
  return(mean(df$freq, na.rm = TRUE) * !any(is.na(df$freq)))
}


lookup_text(text = "qjbstz")
#> [1] 0

^{Created on 2022-04-22 by the reprex package (v2.0.1)}

Thank you for your answer. I know that this is usually done by multyplying the probabilities, so I guess this has some advantages over calculating the mean. Btw: There seems to be a problem with the encoding of the german text file. That's why there are many strange letters. I will fix this later. — D. Studer, Apr 22 '22 at 17:59

R: Gibberish detection using Markov model (how to adjust for the text length)

1 Answers1