I have a character vector which is the file of some PDF scraping via pdftotext
(command line tool).
Everything is (blissfully) nicely lined up. However, the vector is riddled with a type of whitespace that eludes my regular expressions:
> test
[1] "Address:" "Clinic Information:" "Store " "351 South Washburn" "Aurora Quick Care"
[6] "Info" "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718" "Pewaukee"
> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> dput(test)
c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
"Pewaukee")
> test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
+ "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
+ "Pewaukee")
> grepl("[0-9]+ [A-Za-z ]+",test.pasted)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
> Encoding(test)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
> Encoding(test.pasted)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8" "unknown"
Clearly there's some character that's not getting assigned in the dput
, as in the question below:
How to properly dput internationalized text?
I can't copy/paste the entire vector.... How do I search-and-destroy this non-whitespace whitespace?
Edit
Clearly I wasn't even close to clear because answers are all over the place. Here's an even simpler test case:
> grepl("Clinic Information:", test[2])
[1] FALSE
> grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen
[1] TRUE
There is a single space between the word "Clinic" and "Information" printed on the screen and in the dput
output, but whatever is in the string is not a standard space. My goal is to eliminate this so I can properly grep that element out.