Where is this whitespace hiding?

Question

I have a character vector which is the file of some PDF scraping via pdftotext (command line tool).

Everything is (blissfully) nicely lined up. However, the vector is riddled with a type of whitespace that eludes my regular expressions:

> test
[1] "Address:"              "Clinic Information:"   "Store "                "351 South Washburn"    "Aurora Quick Care"    
[6] "Info"                  "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718"   "Pewaukee"  

> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> dput(test)
c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
"Pewaukee")

> test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
+                  "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
+                  "Pewaukee")

> grepl("[0-9]+ [A-Za-z ]+",test.pasted)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

> Encoding(test)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

> Encoding(test.pasted)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown"

Clearly there's some character that's not getting assigned in the dput, as in the question below:

How to properly dput internationalized text?

I can't copy/paste the entire vector.... How do I search-and-destroy this non-whitespace whitespace?

Edit

Clearly I wasn't even close to clear because answers are all over the place. Here's an even simpler test case:

> grepl("Clinic Information:", test[2])
[1] FALSE
> grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen
[1] TRUE

There is a single space between the word "Clinic" and "Information" printed on the screen and in the dput output, but whatever is in the string is not a standard space. My goal is to eliminate this so I can properly grep that element out.

That whitespace is not in the vector itself, it's just in the way it is displayed. — David Robinson, Jul 28 '12 at 17:07
Take a look at `lapply(test[4], utf8ToInt)` and see if there are any big numbers in there. — Alan Curry, Jul 28 '12 at 17:37
@AlanCurry `> lapply(test[4], utf8ToInt) [1] 51 53 49 160 83 111 117 116 104 160 87 97 115 104 98 117 114 110` — Ari B. Friedman, Jul 28 '12 at 20:35
The 160 is your issue. It's a non-breaking space. You could match it (and a few other weird types of spaces) by using a Unicode category in a perl-style regexp: grepl("[0-9]+\\p{Zs}[A-Za-z ]+",test,perl=TRUE) — Alan Curry, Jul 28 '12 at 20:42

score 5 · Accepted Answer · answered Jul 28 '12 at 20:51

5

Upgrading my comment to an answer:

Your string contains a non-breaking space (U+00A0) which got translated to a normal space when you pasted it. Matching all the strange space-like characters in Unicode is easy with a perl-style regular expression:

grepl("[0-9]+\\p{Zs}[A-Za-z ]+", test, perl=TRUE)

The perl regexp syntax is \p{categoryName}, the extra backslash is part of the syntax of a string containing a backslash, and "Zs" is the "Separator" Unicode category, "space" subcategory. A simpler method for just the U+00A0 character would be

grepl("[0-9]+[ \\xa0][A-Za-z ]+", test)

answered Jul 28 '12 at 20:51

Alan Curry

14,255
3
32
33

I'm confused. test[2] wouldn't match anyway, it has no digits in it to match the [0-9] part. – Alan Curry Jul 28 '12 at 21:22
You're right of course. Should've read the regex before running it blindly. Works perfectly once I get the offending digits out, thanks! – Ari B. Friedman Jul 28 '12 at 21:29

Tyler Rinker · Answer 2 · 2012-07-28T16:54:11.683

1

I think you're after trailing and leading white space. If so maybe this function will work:

Trim <- function (x) gsub("^\\s+|\\s+$", "", x)

Also keep an eye out for tabs and such and this may be useful:

clean <- function(text) {
    gsub("\\s+", " ", gsub("\r|\n|\t", " ", text))
}

so use the clean and then the Trim as in:

Trim(clean(test))

Also be on the look out for the en dash (–) and the em dash (—)

edited Jul 28 '12 at 16:54

answered Jul 28 '12 at 16:49

Tyler Rinker

108,132
65
322
519

score 1 · Answer 3 · answered Jul 28 '12 at 17:41

1

I don't see anything unusual about the whitespace, but the dashes in the phone numbers are U+2010 (HYPHEN), not the ASCII hyphen (U+002D).

answered Jul 28 '12 at 17:41

Alan Moore

73,866
12
100
156

score 0 · Answer 4 · answered Jul 28 '12 at 17:09

test <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
"Pewaukee")

> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE


library(stringr)
test2 <- str_trim(test, side = "both")

> grepl("[0-9]+ [A-Za-z ]+",test2)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
# So there were no spaces in the vector, just the screen output in this case.

Where is this whitespace hiding?

4 Answers4

Linked