Pattern matching is failing due to apparent encoding problems with scraped text

Question

Summary edit for Google (if that's ok): Grepl and pattern matching were failing on apparently identical strings. The suspected problem was irregularities encoding scraped text. The real problem was an unseen, invisible extra something in the spaces that didn't show up in "nchar." The solution is to remove all spaces using gsub and regex before attempting pattern matching. Solution was found by smingerson.

Original Question: I would like to perform topic modeling on a collection of online sermons that I scraped using rvest.

I am cleaning and organizing using pattern matching, especially grepl.

The problem is that grepl fails to match apparently identical strings. The scraped text is a mixture of "unknown" and "UTF-8" encoding. Functions like "Encoding", "enc2native", "enc2utf8", "iconv" don't seem to help, nor does adjusting grepl arguments like Perl=TRUE or useBytes = TRUE. (Not that I fully understand what all of these do.)

There seems to be several posts on this: (1) Troubles with encoding, pattern matching and noisy texts in R (2) https://community.rstudio.com/t/enconding-solution-for-linux-and-windows-10/2055 (3) R on Windows: character encoding hell and others.

With respect to #1, I am working in English and not Swedish so I do not see that changing my locale will help. Nor do I understand what portion of the code credited to Wiktor is fixing the problem in the answer provided by the original poster.

With respect to #2, as you'll see below I have attempted using Encoding() to change but with no success.

I am including #3 as a demonstration that many posts discuss foreign languages, while I'm staying in English. They also discuss difficulty with Windows 10 and encoding in RStudio, if that's relevant.

Here is my attempt at reproducible code. Unfortunately, the error seems to come from my original files and isn't reproducible by copying and pasting the following. This is demonstrated by the different results from charToRaw under Edit #1. Per the comment, I added a file on GitHub that contains the error when loaded on my session. Per another comment, I am also adding library calls, and removing some of the whitespace in the center of the "scrapedtitle" because the stackoverflow formatting otherwise introduces a new line character in the middle of the "author" variable. At the end of Edit #2 I have also tried to create a way to copy and paste the troubled encoding using rawToChar but can't coerce to "raw." In Edit #3 I discuss the RStudio options for Encoding, and describe that I saved different scraped portions using different Encoding settings, but didn't keep track of which ones I used when, unfortunately. I expected that the information could have been recoverable and reversible but that doesn't appear to be the case.

#Library calls
library(topicmodels)
library(LDAvis)
library(tm)
library(dplyr)
library(magrittr)
library(stringr)

#The scraped title of a sermon
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

#Extract the author from the title
author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))

#Elsewhere, identify the author from another scraped list of sermons and authors:
scrapedvector <- c("Answers to Prayer", "Brook P. Hales", "Church Auditing Department Report, 2018", "Russell M. Nelson", "By Elder Brook P. Hales")

#attempted grepl: 
which(grepl(author, scrapedvector)) # only returns 2 when it should return 2 and 5

#Exploring:
typed <-"By Elder Brook P. Hales" #This is typed in from my keyboard

typed == scrapedvector[5] # FALSE unexpectedly

grepl(author, typed) #TRUE as you'd expect
grepl(author, scrapedvector[5]) # FALSE unexpectedly

#Checking encoding
Encoding(scrapedvector) #[1] "unknown" "unknown" "unknown" "unknown" "UTF-8"
Encoding(typed) #[1] "unknown"
Encoding(author) #[1] "unknown"

#Attempting to change the encoding:
Encoding(scrapedvector) <- "UTF-8"
Encoding(scrapedvector) # [1] "unknown" "unknown" "unknown" "unknown" "UTF-8" # No change

Edit #1:

# Adding charToRaw information: 
charToRaw(typed)
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b 20 50 2e 20 48 61 6c 65 73
charToRaw(scrapedvector[5]) 
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73
# There's an extra "c2 a0" in the scraped version at the 15th position.

# Results from pasting the vector back into R from this stackoverflow post:
repastedvector <- c("Answers to Prayer", "Brook P. Hales", "Church Auditing Department Report, 2018", "Russell M. Nelson", "By Elder Brook P. Hales")

charToRaw(repastedvector[5])
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b 20 50 2e 20 48 61 6c 65 73
# The repasted string is identical to what I typed, but not to what I saved after scraping.

# Posting this because it is mentioned in other posts
Sys.getlocale()

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Edit #2

An example of the file is available on Github: https://github.com/baprisbrey/stackoverflow/releases/tag/vA0
The file is scrapedTalk2.rds.

This is what I see when I load this file into my RStudio session:

scrapedTalk <- readRDS("scrapedTalk2.rds")
grepl(author, scrapedTalk) %>% which() # Result is 8.  It should be 8 and 73

scrapedvector2 <- scrapedTalk[c(7,8,18,72,73)] # This is the same as the scrapedvector from above 

Encoding(scrapedTalk)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [12] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [23] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [34] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [45] "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown" "unknown" "unknown"
 [56] "UTF-8"   "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [67] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown" "unknown" "UTF-8"   "UTF-8"  
 [78] "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"   "UTF-8"  
 [89] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "unknown" "UTF-8"  
[100] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "unknown"
[111] "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown" "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"  
[122] "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown" "unknown"
[133] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"

scrapedTalk[73] == "By Elder Brook P. Hales" # FALSE, which is unexpected.

charToRaw(scrapedTalk[73]) # for reference
 [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73

# Can I create the troubled encoding by pasting the charToRaw result above?
# Note:  There may be an unintentional newline "/n" character introduced in there due to the length of the string and the StackOverflow formatting.  It should be removed.
troubleString <-  "42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73" %>%
                   strsplit(. ,split=" ") %>%  # so far so good
                   unlist %>%                  # no troubles
                   as.raw %>%                  # NA's and 0's introduced
                   rawToChar                   # failure!

Edit #3 Because the problem appears to be Encoding, I am including a discussion of RStudio encoding options. Under RStudio File >> Save With Encoding is the following menu with options:

There are multiple options for encoding. I do not know what the difference between all of these are. The first question is, why doesn't Encoding() reveal all of these options? Surely the "unknown" bucket covers most of these. Second, due to Encoding difficulties, I toggled with Encoding options and it is highly likely that some of the scraped material was saved with one of these other Encoding options. I don't recall which ones I tried with which portions of the scraped material, however. I recognize the ambiguity this introduces into the problem. I would like to know why I can't recover the proper Encoding, convert to another Encoding, but mostly why I can't enable grepl to work.

I've copied & pasted your example, and I'm receiving all the answers you said you should be. Did you happen to type in any of the other text into the example? it could be the difference of an extra space in one of them, or something similar. — smingerson, Nov 16 '19 at 22:04
When I copy my own example from above back into R, everything works as it should be and as expected. For example, the scrapedvector only shows "unknown" for encoding. This leads me to believe that the problem is not reproduced via copy-and-paste but happens when I load the scraped material. I load an .rds file using the load command. When I generate the example above from this original file, I get the errors again. Is there a way to share this file, or at least a part of it, in order to create an actual reproducible example? — baprisbrey, Nov 16 '19 at 22:28
Can you share the file, or a small portion of it, on a site like github? — smingerson, Nov 16 '19 at 23:27
Here's the Github path to a scraped page that shows this error: https://github.com/baprisbrey/stackoverflow/releases/tag/vA0 The file is scrapedTalk2.rds — baprisbrey, Nov 17 '19 at 00:46
You don't include `library` calls so not reproducible. I get `author [1] "Brook P. \nHales"` and no matches with `grepl`. Suggest making your bottom code segment more reproducible. I needed to attempt piecing together code. — IRTFM, Nov 17 '19 at 02:13

smingerson · Accepted Answer · 2019-11-17T17:21:22.830

There's some kind of space within the value which is not cooperating. After further inspection, it looks like one of them has an extra space, even though it is not evident upon printing. The first bit below shows how to replace multiple spaces with a single space. The second shows how to remove all space-like characters when making comparisons.

Solution 1

library(tidyverse)
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))
# Replace multiple spaces with a single space.
condensedAuthor <- gsub("\\s+", " ", author)

scrapedTalk <- readRDS("scrapedTalk2.rds")
condensedTalk <- gsub("\\s+", " ", scrapedTalk)
indices <- grepl(condensedAuthor, condensedTalk)
scrapedTalk[indices]
# [1] "Brook P. Hales"          "By Elder Brook P. Hales"

Solution 2

library(tidyverse)
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))
condensedAuthor <- gsub("[[:space:]]", "", author)

scrapedTalk <- readRDS("scrapedTalk2.rds")
condensedTalk <- gsub("[[:space:]]", "", scrapedTalk)
indices <- grepl(condensedAuthor, condensedTalk) # Returns 8 and 73 as `TRUE
scrapedTalk[indices] # Get the corresponding values from the original vector.
# [1] "Brook P. Hales"          "By Elder Brook P. Hales"

Edit: I was replacing \\s+ with the regex representation for space, which ended up replacing it with "s", instead of " ". I've updated to use " ".

This works, thank you. I find that both author and condensedAuthor share the same number of characters (nchar results in 14 for both) but that the charToRaw is slightly different. As you can tell from my post, I would have never investigated making adjustments to the spacing via regex. My generic conclusion from this solution is to remove spaces when working with scraped text. This is really an unexpected conclusion for me and I don't know how you figured it out. Thank you. — baprisbrey, Nov 17 '19 at 16:28
I've updated my answer. Both will have the same byte representation now. It looks like the issue was at element 73. `stringr::str_detect(scrapedTalk[73], "\\x{00A0}")` returns `TRUE`. `\\x{00A0}` is a non-breaking space in Unicode. I was also flummoxed how you weren't getting the expected results, so I reasoned it had to be something with a non-visible character, which are typically types of spaces. That, plus a lot of practice with regex. — smingerson, Nov 17 '19 at 17:28

Pattern matching is failing due to apparent encoding problems with scraped text

1 Answers1

Solution 1

Solution 2