Identify meaningless or gibberish text from a data frame in R. Is there a way to partially match string/words to a dictionary?

Question

I am looking to create a variable (column) in my data frame that identifies suspected meaningless text (e.g. "asdkjhfas"), or the inverse. This is part of a general script that will assist my team with cleaning survey data.

A function I found on stackoverflow (link & credit below) allows me to match single words to a dictionary, it does not identify multiple words.

Is there any way I can do a partial match (rather than strict) with a dictionary?

library(qdapDictionaries) # install.packages(qdap)

is.word  <- function(x) x %in% GradyAugmented

x <- c(1, 2, 3, 4, 5, 6)
y <- c("this is text", "word", "random", "Coca-cola", "this is meaningful                 
asdfasdf", "sadfsdf")
df <- data.frame(x,y)


df$z  [is.word(df$y)] <- TRUE
df

In a perfect world I would get a column: df$z TRUE TRUE TRUE TRUE TRUE NA

My actual results are: df$z NA TRUE TRUE NA NA NA

I would be more than happy with: df$z TRUE TRUE TRUE NA TRUE NA

I found the function is.word here Remove meaningless words from corpus in R thanks to user parth

Multiple words would mean tokenizing the text (split it into words), compare it, and then say TRUE or FALSE. The question is should "this is meaningful asdfasdf" return TRUE or FALSE? — phiver, Aug 26 '19 at 10:44
In my case, I think ''this is meaningful asdfasdf" should return TRUE. It is true that it contains meaningful text. — moose-png, Aug 26 '19 at 11:09
I'm looking at using 'unnest_tokens( )' in 'tidytext' to split up the words. I'll post if I get a working example. — moose-png, Aug 26 '19 at 11:13

score 4 · Accepted Answer · answered Aug 26 '19 at 11:49

4

This works with dplyr and tidytext. A bit longer than I expected. There might a short cut somewhere.

I check if a sentence has words in it and count the number of TRUE values. If this is greater than 0, it has text in it, otherwise not.

library(tidytext)
library(dplyr)
df %>% unnest_tokens(words, y) %>% 
  mutate(text = words %in% GradyAugmented) %>% 
  group_by(x) %>% 
  summarise(z = sum(text)) %>% 
  inner_join(df) %>% 
  mutate(z = if_else(z > 0, TRUE, FALSE))


Joining, by = "x"
# A tibble: 6 x 3
      x z     y                          
  <dbl> <lgl> <chr>                      
1     1 TRUE  this is text               
2     2 TRUE  word                       
3     3 TRUE  random                     
4     4 TRUE  Coca-cola                  
5     5 TRUE  this is meaningful asdfasdf
6     6 FALSE sadfsdf

answered Aug 26 '19 at 11:49

phiver

23,048
14
44
56

Thanks @phiver Looking at implementing this, but getting an error: Error in check_input(x) : Input must be a character vector of any length or a list of character vectors, each of which has a length of 1. – moose-png Aug 26 '19 at 13:56
@moose-png, check if your df$x is not a factor column. That doesn't work well with tidytext. It needs to be a character column. – phiver Aug 26 '19 at 13:59
1

found the problem. 'tidytext' only worked with tibbles (not a data frame). – moose-png Aug 26 '19 at 14:29

Ben G · Answer 2 · 2019-08-26T12:43:34.613

Here's a solution using purrr (along with dplyr and stringr):

library(tidyverse)

your_data <- tibble(text = c("this is text", "word", "random", "Coca-cola", "this is meaningful asdfasdf", "sadfsdf"))

your_data %>%
 # split the text on spaces and punctuation
 mutate(text_split = str_split(text, "\\s|[:punct:]")) %>% 
 # see if some element of the provided text is an element of your dictionary
 mutate(meaningful = map_lgl(text_split, some, is.element, GradyAugmented)) 

# A tibble: 6 x 3
  text                        text_split meaningful
  <chr>                       <list>     <lgl>     
1 this is text                <chr [3]>  TRUE      
2 word                        <chr [1]>  TRUE      
3 random                      <chr [1]>  TRUE      
4 Coca-cola                   <chr [2]>  TRUE      
5 this is meaningful asdfasdf <chr [4]>  TRUE      
6 sadfsdf                     <chr [1]>  FALSE

moose-png · Answer 3 · 2019-08-26T15:16:50.237

Thanks, @Ben G & @phiver

both solutions worked. One thing to note is that tidytext only works with tibbles. I made a few tiny adjustments to get it back into a data frame, and thought I would share as well (just in case anyone else needs it in that format).

x <- c(1, 2, 3, 4, 5, 6)
y <- c("this is text", "word", "random", "Coca-cola", "this is meaningful asdfasdf", 
"sadfsdf")
my_tibble <- tibble(x,y)

my_tibble_new = my_tibble %>%
   unnest_tokens(output=word, input="y", token = "words") %>%
   mutate(text = word %in% GradyAugmented) %>%
   group_by(x) %>%
   summarise(z = sum(text)) %>%
   inner_join(my_tibble) %>%
   mutate(z = if_else(z > 0, TRUE, FALSE))

df = as.data.frame(my_tibble_new)

Identify meaningless or gibberish text from a data frame in R. Is there a way to partially match string/words to a dictionary?

3 Answers3