I have a dataset with a column of calendar months in a slightly weird format. For October and after, it's MMYYYY (e.g., "102020", which stands for October 2020), but for the months before October, it's MYYYY (e.g., "92020"), so with just one digit for the month instead of two.
I want to convert everything to YYYY-MM-DD. The day (DD) is not important and should always be "01", but the month (MM) should always take two digits, which means that for months < October (i.e., < 10), it should have a leading zero: "12020" should become "2020-01-01", and "102020" should become "2020-10-01".
I tried to tackle the issue with str_replace
and str_pad
from the stringr
package, and it works if I do it in separate steps; however, I don't see why my attempt to do it in one step fails.
But let me first show you the multi-step approach that works:
library(data.table)
months <- data.table(format1 = paste0(1:12, "2020")) # sample data
# extracting month
months[, month := str_replace(format1, "^(\\d{1,2})(\\d{4})$", "\\1")]
# extracting year
months[, year := str_replace(format1, "^(\\d{1,2})(\\d{4})$", "\\2")]
# creating the desired format
months[, format2 := paste(year,
str_pad(month,
2,
"left",
"0"),
"01",
sep = "-")]
This creates exactly what I want. (Note that I use str_replace
here on purpose to keep it as close to the problem below as possible.)
However, this more direct solution does not work, and I don't understand why:
library(data.table)
months <- data.table(format1 = paste0(1:12, "2020")) # sample data
months[, format2 := str_replace(
format1,
"^(\\d{1,2})(\\d{4})$",
paste("\\2",
str_pad("\\1",
2,
"left",
"0"),
"01",
sep = "-"))]
Here, str_pad
does not seem to do anything. The result is:
format1 format2
1: 12020 2020-1-01
2: 22020 2020-2-01
3: 32020 2020-3-01
4: 42020 2020-4-01
5: 52020 2020-5-01
6: 62020 2020-6-01
7: 72020 2020-7-01
8: 82020 2020-8-01
9: 92020 2020-9-01
10: 102020 2020-10-01
11: 112020 2020-11-01
12: 122020 2020-12-01
However, if I use 3
as the second argument to str_pad
, it gives me 2020-01-01
but also 2020-010-01
.
Why does it not work as expected? And are there other ways to do this more elegantly?