4

I have a large dataset with participants from all over the world. Some of these participants entered data using dots/periods/commas to indicate the thousand separators, but R reads them as commas which totally skews my data... e.g. 1234 become 1,234.

I want to remove all dots/periods/commas. My data is entirely composed of full numbers so there shouldn't be any decimals anywhere.

I tried using stringr, but can't quite figure out. Here is a (I hope) reproducible example with a small sample of my data:

structure(
  list(
    chnb = c(10L, 35L, 55L),
    B1_1_77 = c(117.586,
                4022, 4.921),
    C1_1_88 = c(NA, 2206, 1.111),
    C1_1_99 = c(6.172,
                1884, 0),
    C1_3_99 = c(5.62, 129, 0)
  ),
  row.names = c(NA,-3L),
  class = c("tbl_df",
            "tbl", "data.frame")
)

I tried this:

prob1 <- prob %>% str_replace_all('\\.', '')

which gives me this:

> prob
[1] "c(10, 35, 55)"         "c(117586, 4022, 4921)" "c(NA, 2206, 1111)"    
[4] "c(6172, 1884, 0)"      "c(562, 129, 0)"  

It did indeed remove the dots but it gives me a simple list, and totally lost my data structure. An online search suggested I did this:

prob1 <- prob %>% mutate_all(list(str_replace(., '\\.', '')))

but I get an error message:

Error: .fn must be a length 1 string Call rlang::last_error() to see a backtrace In addition: Warning message: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), : argument is not an atomic vector; coercing

Am I approaching the whole thing wrong? Any help would be greatly appreciated. I hope my question is clear enough, my apologies if it isn't (I'm new to this).

f.lechleitner
  • 3,554
  • 1
  • 17
  • 35
Andrea
  • 73
  • 1
  • 10
  • 2
    I'm not able to reproduce your error message, but try `prob %>% mutate_all(funs(str_replace_all(.,'[\\.,]','')))` to remove all `.` and `,`. You need to double the escaping \ in most R regex situations, and `mutate_all` usually requires a `funs()` wrapper for the function if you want to do anything complicated. – CriminallyVulgar Mar 06 '19 at 15:51
  • `prob` is the data frame you posted the `dput` of? – camille Mar 06 '19 at 15:57
  • @CriminallyVulgar It worked!!! thank you so, so much. How do I give you props for your quick and effective help? – Andrea Mar 06 '19 at 15:59
  • Hum so it works, but now my variables are characters :/ I tried to wrap everything in as.numeric() but get an error message: prob <- prob %>% as.numeric(mutate_all(funs(str_replace_all(.,'[\\.,]','')))) Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "fun_list" – Andrea Mar 06 '19 at 16:04
  • @camille, correct! – Andrea Mar 06 '19 at 16:05
  • @Andrea No need, it was a minor adjustment I pointed out. As far as your overall needs go the answers below fulfill them more fully! – CriminallyVulgar Mar 06 '19 at 16:07

2 Answers2

5

You'll want to convert to character, then replace, then convert back to numeric:

library(tidyverse)
dat %>%
  mutate_all(~as.numeric(str_remove_all(as.character(.x), '\\.')))

# A tibble: 3 x 5
   chnb B1_1_77 C1_1_88 C1_1_99 C1_3_99
  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1    10  117586      NA    6172     562
2    35    4022    2206    1884     129
3    55    4921    1111       0       0

Credit to @camille for the stringr::str_remove_all suggestion.

It's also occurred to me that R may be rounding when you don't intend it to in the instance of a trailing zero. Take the first entry of C1_3_99 in your example, 5.62. This may need to be 5,620 (if the period was a thousands separator), rather than 562, as my first solution gives. You can handle this by using a formatter and thoughtful division:

dat %>%
  mutate_all(~as.numeric(str_remove_all(format(round(.x, 3), nsmall = 3), '\\.')) / 
               if_else(str_detect(.x, "\\."), 1, 1000))

# A tibble: 3 x 5
   chnb B1_1_77 C1_1_88 C1_1_99 C1_3_99
  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1    10  117586      NA    6172    5620
2    35    4022    2206    1884     129
3    55    4921    1111       0       0
Warning message:
In (function (..., .x = ..1, .y = ..2, . = ..1)  :
  NAs introduced by coercion

The formatter ensures there are 3 digits after the decimal, but will add three 0s for numbers without a decimal (formatting code yanked from here), so you divide by 1000 if there is no decimal present. More elegant solutions here are welcome.

zack
  • 5,205
  • 1
  • 19
  • 25
  • 2
    As a shortcut, `stringr::str_remove_all` does the same as `stringr::str_replace_all` with an empty character as the replacement – camille Mar 06 '19 at 16:02
  • thank you so much!! this is brilliant and solves the issue completely while still allowing me to use the data as numeric. it's perfect, greatly appreciated! – Andrea Mar 06 '19 at 16:19
  • No problem - I made a minor update for certain edge cases in the interest of completeness. I'm not sure if they apply to your situation but wanted to do a little digging anyways. – zack Mar 06 '19 at 16:29
0

Just try to use sapply:

df <-  structure(
  list(
    chnb = c(10L, 35L, 55L),
    B1_1_77 = c(117.586,
                4022, 4.921),
    C1_1_88 = c(NA, 2206, 1.111),
    C1_1_99 = c(6.172,
                1884, 0),
    C1_3_99 = c(5.62, 129, 0)
  ),
  row.names = c(NA,-3L),
  class = c("tbl_df",
            "tbl", "data.frame")
)

sapply(df, function(v) {as.numeric(gsub("\\.","", as.character(v)))})

This is the result:

     chnb B1_1_77 C1_1_88 C1_1_99 C1_3_99
[1,]   10  117586      NA    6172     562
[2,]   35    4022    2206    1884     129
[3,]   55    4921    1111       0       0

I hope this helps!

Raul Guerrero
  • 388
  • 2
  • 10
  • The same as we caution question askers to not put code or printout text in images, it's helpful if question answerers can do the same – camille Mar 06 '19 at 16:04
  • Also, now you have a matrix instead of a data frame, and preserving the structure of their data was one thing the OP was concerned with – camille Mar 06 '19 at 16:05