Creating a unified time-series, with dates coming from different (natural) languages

Question

I am using the as.Date function as follows:

x$time_date <- as.Date(x$time_date, format = "%H:%M - %d %b %Y")

This worked fine until I saw a lot of NA values in the output, which I traced back to some of the dates stemming from a different language: German.

My English dates look like this: 18:00 - 10 Dec 2014

Where the German equivalent is: 18:00 - 10 Dez 2014

The month December is abbreviated the German way. This is not recognised by the as.Date function. I have the same problem for five other months:

Mar - März
May - Mai
Jun - Juni
Jul - Juli
Oct - Okt

This looks like it would be of use, but I am unsure of how to implement it for 'unrecognised' formats: How to change multiple Date formats in same column

I attempted to just go through and use gsub to replace all the occurences of German months, but without luck. x below is the data.table and I work on just the time_date column:

 x$time_date <- gsub("(März)?", "Mar", x$time_date) %>%
        gsub("(Mai)?", "May", .) %>%
        gsub("(Juni)?", "Jun", .) %>%
        gsub("(Juli)?", "Jul", .) %>%
        gsub("(Okt)?", "Oct", .) %>%
        gsub("(Dez)?", "Dec", .)

Not only did this not work, but it is also a very slow process and I have nearly 20 GB of pure .csv files to work through.

In the as.Date documentation there is mention of different locales / languages, but not how to work with several simultaneously. I also found instructions on how to use different languages, however my data is all mixed, so I can only thing of a conditional loop using the correct language for each file, however that would also be slow.

Is there a known workaround for this, which I can't find?

G. Grothendieck · Accepted Answer · 2015-11-23T17:42:01.100

1

Create a table tab that contains all the translations and then use subscripting to actually do the translation. The code below seems to work for me on Windows provided your input abbreviations are the same as the standard ones generated but the precise language names ("German", etc.) may vary depending on your system. See ?Sys.setlocale for more information. Also if the abbreviations in your input are different than the ones generated here you will have to add those to tab yourself, e.g. tab <- c(tab, Juli = "Jul")

langs <- c("French", "German", "English")
tab <- unlist(lapply(langs, function(lang) {
  Sys.setlocale("LC_TIME", lang)
  nms <- format(ISOdate(2000, 1:12, 1), "%b")
  setNames(month.abb, nms)
}))

x <- c("18:00 - 10 Juli 2014", "18:00 - 10 Mai 2014") # test input

source_month <- gsub("[^[:alpha:]]", "", x)
mapply(sub, source_month, tab[source_month], x, USE.NAMES = FALSE)

giving:

[1] "18:00 - 10 Jul 2014" "18:00 - 10 May 2014"

edited Nov 23 '15 at 17:42

answered Nov 23 '15 at 16:23

G. Grothendieck

254,981
17
203
341

This is a great solution, however it isn't working for me 100%. My output from your code is `[1] NA "18:00 - 10 May 2014"`. I have used langs = `c("de_DE", "C")` as I am on Mac OS. I also tried c("de_DE", "en_GB") and many other variations, but nothing better than the output I showed above. My Sys.setlocale() is "C". Any ideas why it is only working on the second value in the test vector? – n1k31t4 Nov 23 '15 at 19:35
t seems to be a problem with `Juli` - other months like `Okt` and `Dez` are translated as expected. I can only think the reason is because they have four letters and you sepcify three letter abbreviations with `month.aab`. I tried to test this with `März`, but that showed that umlauts are not dealt with properly - outpu: `"18:00 - 10 M\303\244rz 2014" "18:00 - 10 May 2014"` – n1k31t4 Nov 23 '15 at 19:39
Apologies, adding my own worked - but not for `März`. It isn't able to read the umlaut. Can you think of a way to adjust your answer to say "If the month begins with M, and isn't 'May' or 'Mai' then make it 'Mar'" ? Or another hack like that? – n1k31t4 Nov 23 '15 at 19:48
Changing my own locale to German, i.e. `Sys.setlocale(locale = "de_DE")` allowed R to work with umlauts, so I could use `tab <- c(tab, Juli = "Jul", Juni = "Jun", März = "Mar")` and the correct answer was given for the test with März. Is there a more elegant solution? – n1k31t4 Nov 23 '15 at 19:53
1

If you can get it to work by changing the locale then I would do that and add the missing ones manually as in your comment. I think that would be less messy than trying to get general rules that might later fail if you have to match even more languages. – G. Grothendieck Nov 23 '15 at 20:03
I haven't seen any downside w.r.t. English language from having my locale set to "German" (Windows) and ""de_DE" (Unix), so I will stick with that for now. – n1k31t4 Nov 23 '15 at 20:09

Creating a unified time-series, with dates coming from different (natural) languages

1 Answers1