1

I have a dataset with a column of calendar months in a slightly weird format. For October and after, it's MMYYYY (e.g., "102020", which stands for October 2020), but for the months before October, it's MYYYY (e.g., "92020"), so with just one digit for the month instead of two.

I want to convert everything to YYYY-MM-DD. The day (DD) is not important and should always be "01", but the month (MM) should always take two digits, which means that for months < October (i.e., < 10), it should have a leading zero: "12020" should become "2020-01-01", and "102020" should become "2020-10-01".

I tried to tackle the issue with str_replace and str_pad from the stringr package, and it works if I do it in separate steps; however, I don't see why my attempt to do it in one step fails.

But let me first show you the multi-step approach that works:

library(data.table)
months <- data.table(format1 = paste0(1:12, "2020")) # sample data

# extracting month
months[, month := str_replace(format1, "^(\\d{1,2})(\\d{4})$", "\\1")]

# extracting year
months[, year := str_replace(format1, "^(\\d{1,2})(\\d{4})$", "\\2")]

# creating the desired format
months[, format2 := paste(year, 
                          str_pad(month, 
                                  2, 
                                  "left", 
                                  "0"),
                          "01",
                          sep = "-")]

This creates exactly what I want. (Note that I use str_replace here on purpose to keep it as close to the problem below as possible.)

However, this more direct solution does not work, and I don't understand why:

library(data.table)
months <- data.table(format1 = paste0(1:12, "2020")) # sample data

months[, format2 := str_replace(
    format1, 
    "^(\\d{1,2})(\\d{4})$", 
    paste("\\2", 
          str_pad("\\1", 
                  2, 
                  "left", 
                  "0"),
          "01",
          sep = "-"))]

Here, str_pad does not seem to do anything. The result is:

    format1    format2
 1:   12020  2020-1-01
 2:   22020  2020-2-01
 3:   32020  2020-3-01
 4:   42020  2020-4-01
 5:   52020  2020-5-01
 6:   62020  2020-6-01
 7:   72020  2020-7-01
 8:   82020  2020-8-01
 9:   92020  2020-9-01
10:  102020 2020-10-01
11:  112020 2020-11-01
12:  122020 2020-12-01

However, if I use 3 as the second argument to str_pad, it gives me 2020-01-01 but also 2020-010-01.

Why does it not work as expected? And are there other ways to do this more elegantly?

Ben
  • 26
  • 3

0 Answers0