Types of Whitespace in R

Question

My question is about whitespace in R. There have been many questions regarding whitespace in R, but I haven't found any about types of whitespace producing inconsistent behavior.

I scraped a table from Wikipedia and I was trying to separate a column with a whitespace (e.g., Minnesota 6) into two columns (c(Minnesota, 6)). I tried using tidyr's separate() function and gotten the maddening error message Expected 2 pieces. Missing pieces filled with NA in 364 rows ... It seems that separate() does not recognize the whitespace before the number as whitespace. Interestingly, it does recognize the whitespace when it's in a state name (e.g. South Dakota, New York).

Code that produces error:

reps %<>% 
  clean_names() %>% 
  separate(district, into = c('state', 'd'), sep = '\\s', remove = FALSE)

Nevertheless, when I run sum(str_detect(reps$District, '\\s')) I get 435, which is the number of rows. So it is detecting whitespace before a number.

A further twist. When I export the dataframe to a .csv and then read it in, the problem with separate() disappears. But still, I would like to know what this invisible problem is.

Here you can find the .rds and here the .csv, if you're into that kind of thing.

Where did you get the `clean_names` function? It's not a basic tidyverse function. — user2554330, Feb 01 '21 at 17:37
Try this for removing white space. https://stackoverflow.com/a/55072311/6497137 — Ryan John, Feb 01 '21 at 17:47

score 3 · Accepted Answer · answered Feb 01 '21 at 18:05

You can use the tools::showNonASCII function to display non-ascii characters. Here's what I see:

> tools::showNonASCII(head(reps$District))
1: Alabama<c2><a0>1
2: Alabama<c2><a0>2
3: Alabama<c2><a0>3
4: Alabama<c2><a0>4
5: Alabama<c2><a0>5
6: Alabama<c2><a0>6

So these entries have the UTF-8 code C2 A0, which is a non-breaking space. You can convert it to a standard space using

reps$District <- sub("\ua0", " ", reps$District)

(UTF-8 C2 A0 is code point 00A0 according to http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=c2+a0&mode=bytes).

Your question title was "Types of Whitespace in R", which isn't really well defined. Different functions use different definitions. You'll have to read the documentation or source code to find out what the separate function thinks '\\s' means. Base R supports several regex styles; see ?regex.

Fair point about the title not being well-defined. Unfortunately for someone as unaware of under-the-hood R as I am, there was no way at the time for me be any more precise in formulating the title. — Justin, Feb 01 '21 at 18:37

Types of Whitespace in R

1 Answers1