13

I'm used to use trimws to get rid of any blank spaces on text.

Now I've a df that was made with scraped data.

I've 2 columns that relate to money but are chr vectors because they where scraped from a web, as mentioned before. To one column I can apply trimws with no problem, but not to the other one.

str(lacuracao_tvs$precio_actual[1])
chr " 1199.00"

Why?

new_precio_actual <- trimws(lacuracao_tvs$precio_actual[1])

dput(new_precio_actual)
" 1199.00"

trimws works in precio_antes but not in precio_actual:

> str(lacuracao_tvs)
'data.frame':   100 obs. of  4 variables:
 $ ecommerce    : chr  "la-curacao" "la-curacao" "la-curacao" "la-curacao" ...
 $ producto     : chr  "TV LED AOC Ultra HD Smart 50\" LE50U7970" "TV Samsung Ultra HD 4K Smart 58\" UN-58RU7100G" "TV LG Ultra HD 4K Smart AI 55\" 55UK6200" "TV AOC Ultra HD 4K Smart 55\" 55U6285" ...
 $ precio_antes : chr  "1899.00" "1899.00" "1899.00" "1899.00" ...
 $ precio_actual: chr  " 1199.00" " 1199.00" " 1199.00" " 1199.00" ...

SessionInfo:

Sys.info()
          sysname           release           version          nodename 
        "Windows"          "10 x64"     "build 17763" "DESKTOP-MNDUKBD" 
          machine             login              user    effective_user 
         "x86-64"       "OGONZALES"       "OGONZALES"       "OGONZALES" 
> sessionInfo(package = NULL)
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.7.8     rvest_0.3.2     xml2_1.2.0      RSelenium_1.7.5

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       rstudioapi_0.9.0 bindr_0.1.1      magrittr_1.5    
 [5] rappdirs_0.3.1   tidyselect_0.2.5 R6_2.3.0         rlang_0.3.1     
 [9] stringr_1.3.1    httr_1.4.0       caTools_1.17.1.1 tools_3.5.2     
[13] binman_0.1.1     selectr_0.4-1    semver_0.2.0     subprocess_0.8.3
[17] yaml_2.2.0       openssl_1.1      assertthat_0.2.0 tibble_2.0.1    
[21] crayon_1.3.4     bindrcpp_0.2.2   purrr_0.2.5      bitops_1.0-6    
[25] curl_3.3         glue_1.3.0       wdman_0.2.4      stringi_1.2.4   
[29] compiler_3.5.2   pillar_1.3.1     XML_3.98-1.20    jsonlite_1.6    
[33] pkgconfig_2.0.2

UPDATE 1:

utf8ToInt(lacuracao_tvs$precio_actual[1])
[1] 160  49  49  57  57  46  48  48
Omar Gonzales
  • 3,806
  • 10
  • 56
  • 120
  • 1
    Could you run utf8ToInt(lacuracao_tvs$precio_actual[1]) and share the output? – Katia Jun 26 '19 at 23:24
  • @Katia please, see update 1. – Omar Gonzales Jun 26 '19 at 23:26
  • 2
    Yes, That is what I thought. A character with ascii code 160 is not a white space strictly speaking. So that is why you see it as a "blank space" and R does not. trimws only remove the following characters [ \t\r\n]. Let me come up with a code that cleans your character vectors and I will post it soon. – Katia Jun 26 '19 at 23:31
  • 1
    Possible duplicate of [trimws bug? leading whitespace not removed](https://stackoverflow.com/questions/45050617/trimws-bug-leading-whitespace-not-removed) – Ritchie Sacramento Jun 26 '19 at 23:51
  • 1
    @Katia Where do you find the ascii codes equivalents? I've googled but found pages with up to 126 code. – Omar Gonzales Jun 27 '19 at 14:44
  • 1
    @OmarGonzales You can use utf8ToInt(x) function in R to convert your string to a vector with ASCII codes. You can also look at the following link https://www.utf8-chartable.de/unicode-utf8-table.pl (there are multiple pages), but the codes there are given in hexadecimal format by default. You can press "decimal" on the top of the page to select decimal format – Katia Jun 27 '19 at 15:28

3 Answers3

19

The character with ASCII code 160 is called a "non-breaking space." One can read about it in Wikipedia:

https://en.wikipedia.org/wiki/Non-breaking_space

The trimws() function does not include it in the list of characters that are removed by the function:

x <- intToUtf8(c(160,49,49,57,57,46,48,48))
x
#[1] " 1199.00"

trimws(x)
#[1] " 1199.00"

One way to get rid of it is by using str_trim() function from the stringr library:

library(stringr)
y <- str_trim(x)
trimws(y)
[1] "1199.00"

Another way is by applying iconv() function first:

y <- iconv(x, from = 'UTF-8', to = 'ASCII//TRANSLIT')
trimws(y)
#[1] "1199.00"

UPDATE To explain why trimws() does not remove the "invisible" character described above and stringr::str_trim() does.

Here is what we read from trimws() help:

For portability, ‘whitespace’ is taken as the character class [ \t\r\n] (space, horizontal tab, line feed, carriage return)

For stringr::str_trim() help topic itself does not specify what is considered a "white space" but if you look at the help for stri_trim_both which is called by str_trim() you will see: stri_trim_both(str, pattern = "\\P{Wspace}") Basically in this case it is using a wider range of characters that are considered as a white space.

UPDATE 2

As @H1 noted, version 3.6.0 provides an option to specify what to consider a whitespace character:

Internally, 'sub(re, "", *, perl = TRUE)', i.e., PCRE library regular expressions are used. For portability, the default 'whitespace' is the character class '[ \t\r\n]' (space, horizontal tab, carriage return, newline). Alternatively, '[\h\v]' is a good (PCRE) generalization to match all Unicode horizontal and vertical white space characters, see also <URL: https://www.pcre.org>.

So if you are using version 3.6.0 or later you can simply do:

> trimws(x,whitespace = "[\\h\\v]")
#[1] "1199.00"
Dustin Stoltz
  • 40
  • 1
  • 5
Katia
  • 3,784
  • 1
  • 14
  • 27
  • May you explain why `str_trim` removes the invisible space and `trimws` not? – Omar Gonzales Jun 26 '19 at 23:55
  • 1
    @OmarGonzales Just added some info to my answer. See if this answers your question. – Katia Jun 27 '19 at 00:08
  • Thanks for the thorough answer, Katia. The [\h\v] generalization for matching all space characters is very useful. Don't understand why it is not the standard on trimws() – Dan Jun 10 '22 at 13:23
5

From R version 3.6.0 trimws() has an argument allowing you to define what is considered whitespace which in this case is a no break space.

trimws(x, whitespace = "\u00A0|\\s")
[1] "1199.00"
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
1

Short Answer; use enc2native() and str_trim()

Long Answer; I had a an issue where a database query included non utf-8 encoding text which resulted in the following error.

Error in sub(re, "", x, perl = TRUE) : input string 5 is invalid UTF-8

I initially used utf8_encode wrapped in an lapply function, however this resulted in all new line and enter characters being replaced with \r & \n which I found undesirable (note, not wrapping it converts the whole df into a character string).

Using enc2native(y) %>% str_trim() avoided this, however to apply it to a df I made a custom function.

    cleanDBO <- function(x){
      # Use enc2native as it will replace non utf8 characters with something 
      # readable and not replace \r, \n etc with text.
      x <- x %>% 
        lapply(., function(y) { 
          if(is.character(y)) enc2native(y)  %>% str_trim()
          else y }) %>% as_tibble()
    }

This leaves all non-character columns as they are, without the if else all columns are converted to character.

lifedroid
  • 164
  • 2
  • 7