1

I would like to reformat a factor vector so the figures that it contains have a thousand separator. The vector contains integer and real number without any particular rule with respect to the values or order.

Data

In particular, I'm working with a vector vec similar to the one generated below:

content <- c("0 - 100", "0 - 100", "0 - 100", "0 - 100",
             "150.22 - 170.33",
             "1000 - 2000","1000 - 2000", "1000 - 2000", "1000 - 2000", 
             "7000 - 10000", "7000 - 10000", "7000 - 10000", "7000 - 10000",
             "7000 - 10000", "1000000 - 22000000", "1000000 - 22000000", 
             "1000000 - 22000000",
             "44000000 - 66000000.8989898989")

vec <- factor(x = content, levels = unique(content))

Desired results

My ambition is to reformat this vector so the figures contain the Excel-like 1,000 separataor, as in the example below:

100.00 1,000.00
1,000,000.00
1,000,000.56
24,564,000,000.56


Tried approach

I was thinking of making use of the gsubfn and a proto object that would pass the digit. Then maybe createing another proto object with 3 digits and replacing. As suggested in the code below:

gsubfn(pattern = "[0-9][0-9][0-9]", replacement = ~paste0(x, ','), 
       x = as.character(vec))

This works only partuially as comma is insterted in:

"150,.22 - 170,.33"

which obviously is wrong. I also had to convert the character vector to factor. Consquently, my question boils down to two elements:

  • How can I work around the comma issue?
  • How can I maintain the original structure of the factor? - I need to have a factor vector ordered in the same manner as the original one but with commas in right places.
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Konrad
  • 17,740
  • 16
  • 106
  • 167

3 Answers3

1

Use positive lookahead based regex...

content <- c("0 - 100", "0 - 100", "0 - 100", "0 - 100",
              "1000 - 2000","1000 - 2000", "1000 - 2000", "1000 - 2000", 
              "7000 - 10000", "7000 - 10000", "7000 - 10000", "7000 - 10000",
              "7000 - 10000", "1000000 - 22000000", "1000000 - 22000000", 
              "1000000 - 22000000")
gsub("(\\d)(?=(?:\\d{3})+\\b)", "\\1,", content, perl=T)
# [1] "0 - 100"                "0 - 100"                "0 - 100"               
# [4] "0 - 100"                "1,000 - 2,000"          "1,000 - 2,000"         
# [7] "1,000 - 2,000"          "1,000 - 2,000"          "7,000 - 10,000"        
# [10] "7,000 - 10,000"         "7,000 - 10,000"         "7,000 - 10,000"        
# [13] "7,000 - 10,000"         "1,000,000 - 22,000,000" "1,000,000 - 22,000,000"
# [16] "1,000,000 - 22,000,000"
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • I recon that I have to learn more about lookaheads. BTW, I may have really messy values, like `"44000000 - 66000000.8989898989"`. Ideally, I want to mimick what this unfortunate *thousand separator* in Excel does. Also, I have this addittional hassle of sorting the levels in this thing. – Konrad Jan 05 '16 at 09:45
  • 1
    `gsub("(\\d)(?=(?:\\d{3})+[\\s.])", "\\1,", content, perl=T)` – Avinash Raj Jan 05 '16 at 09:47
  • Works in terms of fixing the unfortunate number. Presumeably, I can make use of [the somehow related discussion](http://stackoverflow.com/questions/33522278/ordering-a-complex-string-vector-in-order-to-obtain-a-ordered-factor) to order it again and make a nice factor. – Konrad Jan 05 '16 at 09:49
1

Maybe you can use formatC :

sapply(
  X = lapply(
    X = strsplit(x = content, split = " - "),
    FUN = function(x) {
      formatC(x = as.numeric(x), format = "f", flag = "#", big.mark = ",", 
              decimal.mark = ".", digits = 2, drop0trailing = FALSE)
    }
  ),
  FUN = paste, collapse = " - "
)
# [1] "0.00 - 100.00"                 "0.00 - 100.00"                 "0.00 - 100.00"                
# [4] "0.00 - 100.00"                 "150.22 - 170.33"               "1,000.00 - 2,000.00"          
# [7] "1,000.00 - 2,000.00"           "1,000.00 - 2,000.00"           "1,000.00 - 2,000.00"          
# [10] "7,000.00 - 10,000.00"          "7,000.00 - 10,000.00"          "7,000.00 - 10,000.00"         
# [13] "7,000.00 - 10,000.00"          "7,000.00 - 10,000.00"          "1,000,000.00 - 22,000,000.00" 
# [16] "1,000,000.00 - 22,000,000.00"  "1,000,000.00 - 22,000,000.00"  "44,000,000.00 - 66,000,000.90"
Victorp
  • 13,636
  • 2
  • 51
  • 55
  • It works to an extent, it rounded the last value `"44,000,000.00 - 66,000,000.90"` but I can live with this. The thing is that I've to get ordered factor not a character vector at the end of the transformation. Simply put, I want my original variables with *Excel-like* thousand separators in the right places. – Konrad Jan 05 '16 at 09:55
1

Operating only on the levels seem to keep your precision level, not converting your vector to character vector and much more efficient as it is reducing the size of the data you operate on only to the unique values (rather the whole vector)

levels(vec) <- sapply(strsplit(levels(vec), " - "), 
                       function(x) paste(prettyNum(x, 
                                            big.mark = ",", 
                                            preserve.width = "none"), 
                                   collapse = " - "))
vec
#  [1] 0 - 100                            0 - 100                            0 - 100                            0 - 100                            150.22 - 170.33                   
#  [6] 1,000 - 2,000                      1,000 - 2,000                      1,000 - 2,000                      1,000 - 2,000                      7,000 - 10,000                    
# [11] 7,000 - 10,000                     7,000 - 10,000                     7,000 - 10,000                     7,000 - 10,000                     1,000,000 - 22,000,000            
# [16] 1,000,000 - 22,000,000             1,000,000 - 22,000,000             44,000,000 - 66,000,000.8989898989
# Levels: 0 - 100 150.22 - 170.33 1,000 - 2,000 7,000 - 10,000 1,000,000 - 22,000,000 44,000,000 - 66,000,000.8989898989 
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks very much for your contribution. So the complete solution would be to address the `levels` and content of the factor separatly? – Konrad Jan 05 '16 at 10:14
  • 1
    The `levels` are integral part of the content. If you address them, you address the content, the only difference is that you are not converting your vector to `character` and then back to `factor` (which could cost lots of memory for a big vector) and you also operating only on the unique values. Though I haven't really tested for efficiency the whole process and rather mainly speculating. – David Arenburg Jan 05 '16 at 10:15