6

I have a character vector which is the file of some PDF scraping via pdftotext (command line tool).

Everything is (blissfully) nicely lined up. However, the vector is riddled with a type of whitespace that eludes my regular expressions:

> test
[1] "Address:"              "Clinic Information:"   "Store "                "351 South Washburn"    "Aurora Quick Care"    
[6] "Info"                  "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718"   "Pewaukee"  

> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> dput(test)
c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
"Pewaukee")

> test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
+                  "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
+                  "Pewaukee")

> grepl("[0-9]+ [A-Za-z ]+",test.pasted)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

> Encoding(test)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

> Encoding(test.pasted)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown"

Clearly there's some character that's not getting assigned in the dput, as in the question below:

How to properly dput internationalized text?

I can't copy/paste the entire vector.... How do I search-and-destroy this non-whitespace whitespace?

Edit

Clearly I wasn't even close to clear because answers are all over the place. Here's an even simpler test case:

> grepl("Clinic Information:", test[2])
[1] FALSE
> grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen
[1] TRUE

There is a single space between the word "Clinic" and "Information" printed on the screen and in the dput output, but whatever is in the string is not a standard space. My goal is to eliminate this so I can properly grep that element out.

Community
  • 1
  • 1
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235

4 Answers4

5

Upgrading my comment to an answer:

Your string contains a non-breaking space (U+00A0) which got translated to a normal space when you pasted it. Matching all the strange space-like characters in Unicode is easy with a perl-style regular expression:

grepl("[0-9]+\\p{Zs}[A-Za-z ]+", test, perl=TRUE)

The perl regexp syntax is \p{categoryName}, the extra backslash is part of the syntax of a string containing a backslash, and "Zs" is the "Separator" Unicode category, "space" subcategory. A simpler method for just the U+00A0 character would be

grepl("[0-9]+[ \\xa0][A-Za-z ]+", test)
Alan Curry
  • 14,255
  • 3
  • 32
  • 33
1

I think you're after trailing and leading white space. If so maybe this function will work:

Trim <- function (x) gsub("^\\s+|\\s+$", "", x)

Also keep an eye out for tabs and such and this may be useful:

clean <- function(text) {
    gsub("\\s+", " ", gsub("\r|\n|\t", " ", text))
}

so use the clean and then the Trim as in:

Trim(clean(test))

Also be on the look out for the en dash (–) and the em dash (—)

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
1

I don't see anything unusual about the whitespace, but the dashes in the phone numbers are U+2010 (HYPHEN), not the ASCII hyphen (U+002D).

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
0
test <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
"Pewaukee")

> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE


library(stringr)
test2 <- str_trim(test, side = "both")

> grepl("[0-9]+ [A-Za-z ]+",test2)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
# So there were no spaces in the vector, just the screen output in this case.
Maiasaura
  • 32,226
  • 27
  • 104
  • 108