2

I have a data frame which includes a Reference column. This is a 10 digit number, which could start with zeros. When importing into R, the leading zeros disappear, which I would like to add back in.

I have tried using sprintf and formatC, but I have different problems with each.

DF=data.frame(Reference=c(102030405,2567894562,235648759), Data=c(10,20,30))

The outputs I get are the following:

> sprintf('%010d', DF$Reference)
[1] "0102030405" "        NA" "0235648759"
Warning message:
In sprintf("%010d", DF$Reference) : NAs introduced by coercion
> formatC(DF$Reference, width=10, flag="0")
[1] "001.02e+08" "02.568e+09" "02.356e+08"

The first output gives NA when the number already has 10 digits, and the second stores the result in standard form.

What I need is:

[1]  0102030405 2567894562  0235648759
sym246
  • 1,836
  • 3
  • 24
  • 50
  • 2
    I think your expected output is not reflecting with the leading zeros.. – akrun Mar 07 '16 at 12:54
  • working through the examples in http://stackoverflow.com/questions/5812493/adding-leading-zeros-using-r, leads to `library(stringr); str_pad(DF$Reference, 10, pad = "0")` – user20650 Mar 07 '16 at 12:55
  • I just spotted that, and have edited the post. I haven't come acorss `str_pad` before, but it seems to be doing the trick. Thank you. – sym246 Mar 07 '16 at 12:57
  • http://stackoverflow.com/questions/14589354/struggling-with-integers-maximum-integer-size might explain results – user20650 Mar 07 '16 at 13:06

2 Answers2

6
library(stringi)
DF = data.frame(Reference = c(102030405,2567894562,235648759), Data = c(10,20,30))
DF$Reference = stri_pad_left(DF$Reference, 10, "0")
DF
#    Reference Data
# 1 0102030405   10
# 2 2567894562   20
# 3 0235648759   30

Alternative solutions: Adding leading zeros using R.

When importing into R, the leading zeros disappear, which I would like to add back in.

Reading the column(s) in as characters would avoid this problem outright. You could use readr::read_csv() with the col_types argument.

Community
  • 1
  • 1
effel
  • 1,421
  • 1
  • 9
  • 17
  • 1
    Props for the real solution: read the file correctly in the first place. – Hong Ooi Mar 07 '16 at 13:51
  • 2
    Although `read.csv` with the `colClasses` argument works just as well as `read_csv` with `col_types`. – Hong Ooi Mar 07 '16 at 13:52
  • That's right, thanks for pointing to colClasses. (http://stackoverflow.com/questions/2805357/specifying-colclasses-in-the-read-csv) – effel Mar 07 '16 at 13:56
1

formatC

You can use

formatC(DF$Reference, digits = 0,  width = 10, format ="f", flag="0")
# [1] "0102030405" "2567894562" "0235648759"

sprintf

The use of d in sprintf means that your values are integers (or they have to be converted with as.integer()). help(integer) explains that:

"the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly."

That is why as.integer(2567894562) returns NA.

Another work around would be to use a character format s in sprintf:

sprintf('%010s',DF$Reference)
# [1] " 102030405" "2567894562" " 235648759"

But this gives spaces instead of leading zeros. gsub() can add zeros back by replacing spaces with zeros:

gsub(" ","0",sprintf('%010s',DF$Reference))
# [1] "0102030405" "2567894562" "0235648759"
Paul Rougieux
  • 10,289
  • 4
  • 68
  • 110