15

I have lots of dates in a column in a CSV file that I need to convert from dd/mm/yyyy to yyyy-mm-dd format. For example 17/01/2010 should be converted to 2010-01-17.

How can I do this in Perl or Python?

brian d foy
  • 129,424
  • 31
  • 207
  • 592
FunLovinCoder
  • 7,597
  • 11
  • 46
  • 57
  • 6
    There's a big difference between solving the problem for isolated, well-formed ASCII strings with nothing else in them, and solving it for strings wherever they occur as dates in presumably arbitrary text files — **especially** if it’s proper Unicode data, not merely legacy ASCII. – tchrist Nov 02 '10 at 15:07

8 Answers8

30

If you are guaranteed to have well-formed data consisting of nothing else but a singleton date in the DD-MM-YYYY format, then this works:

# FIRST METHOD
my $ndate = join("-" => reverse split(m[/], $date));

That works on a $date holding "07/04/1776" but fails on "this 17/01/2010 and that 01/17/2010 there". Instead, use:

# SECOND METHOD
($ndate = $date) =~ s{
    \b
      ( \d \d   )
    / ( \d \d   )
    / ( \d {4}  )
    \b
}{$3-$2-$1}gx;

If you prefer a more "grammatical" regex, so that it’s easier to maintain and update, you can instead use this:

# THIRD METHOD
($ndate = $date) =~ s{
    (?&break)

              (?<DAY>    (?&day)    )
    (?&slash) (?<MONTH>  (?&month)  )
    (?&slash) (?<YEAR>   (?&year)   )

    (?&break)

    (?(DEFINE)
        (?<break> \b     )
        (?<slash> /      )
        (?<year>  \d {4} )
        (?<month> \d {2} )
        (?<day>   \d {2} )
    )
}{
    join "-" => @+{qw<YEAR MONTH DAY>}
}gxe;

Finally, if you have Unicode data, you might want to be a bit more careful.

# FOURTH METHOD
($ndate = $date) =~ s{
    (?&break_before)
              (?<DAY>    (?&day)    )
    (?&slash) (?<MONTH>  (?&month)  )
    (?&slash) (?<YEAR>   (?&year)   )
    (?&break_after)

    (?(DEFINE)
        (?<slash>     /                  )
        (?<start>     \A                 )
        (?<finish>    \z                 )

        # don't really want to use \D or [^0-9] here:
        (?<break_before>
           (?<= [\pC\pP\pS\p{Space}] )
         | (?<= \A                )
        )
        (?<break_after>
            (?= [\pC\pP\pS\p{Space}]
              | \z
            )
        )
        (?<digit> \d            )
        (?<year>  (?&digit) {4} )
        (?<month> (?&digit) {2} )
        (?<day>   (?&digit) {2} )
    )
}{
    join "-" => @+{qw<YEAR MONTH DAY>}
}gxe;

You can see how each of these four approaches performs when confronted with sample input strings like these:

my $sample  = q(17/01/2010);
my @strings =  (
    $sample,  # trivial case

    # multiple case
    "this $sample and that $sample there",

    # multiple case with non-ASCII BMP code points
    # U+201C and U+201D are LEFT and RIGHT DOUBLE QUOTATION MARK
    "from \x{201c}$sample\x{201d} through\xA0$sample",

    # multiple case with non-ASCII code points
    #   from both the BMP and the SMP 
    # code point U+02013 is EN DASH, props \pP \p{Pd}
    # code point U+10179 is GREEK YEAR SIGN, props \pS \p{So}
    # code point U+110BD is KAITHI NUMBER SIGN, props \pC \p{Cf}
    "\x{10179}$sample\x{2013}\x{110BD}$sample",
);

Now letting $date be a foreach iterator through that array, we get this output:

Original is:   17/01/2010
First method:  2010-01-17
Second method: 2010-01-17
Third method:  2010-01-17
Fourth method: 2010-01-17

Original is:   this 17/01/2010 and that 17/01/2010 there
First method:  2010 there-01-2010 and that 17-01-this 17
Second method: this 2010-01-17 and that 2010-01-17 there
Third method:  this 2010-01-17 and that 2010-01-17 there
Fourth method: this 2010-01-17 and that 2010-01-17 there

Original is:   from “17/01/2010” through 17/01/2010
First method:  2010-01-2010” through 17-01-from “17
Second method: from “2010-01-17” through 2010-01-17
Third method:  from “2010-01-17” through 2010-01-17
Fourth method: from “2010-01-17” through 2010-01-17

Original is:   17/01/2010–17/01/2010
First method:  2010-01-2010–17-01-17
Second method: 2010-01-17–2010-01-17
Third method:  2010-01-17–2010-01-17
Fourth method: 2010-01-17–2010-01-17

Now let’s suppose that you actually do want to match non-ASCII digits. For example:

   U+660  ARABIC-INDIC DIGIT ZERO
   U+661  ARABIC-INDIC DIGIT ONE
   U+662  ARABIC-INDIC DIGIT TWO
   U+663  ARABIC-INDIC DIGIT THREE
   U+664  ARABIC-INDIC DIGIT FOUR
   U+665  ARABIC-INDIC DIGIT FIVE
   U+666  ARABIC-INDIC DIGIT SIX
   U+667  ARABIC-INDIC DIGIT SEVEN
   U+668  ARABIC-INDIC DIGIT EIGHT
   U+669  ARABIC-INDIC DIGIT NINE

or even

 U+1D7F6  MATHEMATICAL MONOSPACE DIGIT ZERO
 U+1D7F7  MATHEMATICAL MONOSPACE DIGIT ONE
 U+1D7F8  MATHEMATICAL MONOSPACE DIGIT TWO
 U+1D7F9  MATHEMATICAL MONOSPACE DIGIT THREE
 U+1D7FA  MATHEMATICAL MONOSPACE DIGIT FOUR
 U+1D7FB  MATHEMATICAL MONOSPACE DIGIT FIVE
 U+1D7FC  MATHEMATICAL MONOSPACE DIGIT SIX
 U+1D7FD  MATHEMATICAL MONOSPACE DIGIT SEVEN
 U+1D7FE  MATHEMATICAL MONOSPACE DIGIT EIGHT
 U+1D7FF  MATHEMATICAL MONOSPACE DIGIT NINE

So imagine you have a date in mathematical monospace digits, like this:

$date = "\x{1D7F7}\x{1D7FD}/\x{1D7F7}\x{1D7F6}/\x{1D7F8}\x{1D7F6}\x{1D7F7}\x{1D7F6}";

The Perl code will work just fine on that:

Original is:   //
First method:  --
Second method: --
Third method:  --
Fourth method: --

I think you’ll find that Python has a pretty brain‐damaged Unicode model whose lack of support for abstract characters and strings irrespective of content makes it ridiculously difficult to write things like this.

It’s also tough to write legible regular expressions in Python where you decouple the declaration of the subexpressions from their execution, since (?(DEFINE)...) blocks are not supported there. Heck, Python doesn’t even support Unicode properties. It’s just not suitable for Unicode regex work because of this.

But hey, if you think that’s bad in Python compared to Perl (and it certainly is), just try any other language. I haven’t found one that isn’t still worse for this sort of work.

As you see, you run into real problems when you ask for regex solutions from multiple languages. First of all, the solutions are difficult to compare because of the different regex flavors. But also because no other language can compare with Perl for power, expressivity, and maintainability in its regular expressions. This may become even more obvious once arbitrary Unicode enters the picture.

So if you just wanted Python, you should have asked for only that. Otherwise it’s a terribly unfair contest that Python will nearly always lose; it’s just too messy to get things like this correct in Python, let alone both correct and clean. That’s asking more of it than it can produce.

In contrast, Perl’s regexes excel at both those.

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • 6
    ++ for the education value. – Zaid Nov 02 '10 at 16:27
  • 2
    @Zaid,@Frost: You’re very welcome. Unicode regexes are very much in my head these days, and I’m trying to teach people that you really can write clean regexes that are simultaneously portable, legible, and maintainable. I’m trying to put down the "regexes are inscrutable" myth. Of course if you can’t use comments, whitespace, alphabetic identifiers, or decouple your declarations from their execution, it is completely hopeless. So don’t do that: use all those techniques in regexes, just as in any other programming language. – tchrist Nov 02 '10 at 17:46
  • I'm wondering whether using `(?(DEFINE)...)` here actually does anything that couldn't be accomplished by just good ol' interpolation of strings: `$slash = qr(/); $start = qr/\A/;` etc., then interpolate these into the substitution instead of writing `(?&slash)` etc. – j_random_hacker Nov 09 '10 at 03:20
  • -1 to get your attention, will revert of course if you can answer my question! :) – j_random_hacker Jan 03 '11 at 18:14
20
>>> from datetime import datetime
>>> datetime.strptime('02/11/2010', '%d/%m/%Y').strftime('%Y-%m-%d')
'2010-11-02'

or more hackish way (that doesn't check for validity of values):

>>> '-'.join('02/11/2010'.split('/')[::-1])
'2010-11-02'
>>> '-'.join(reversed('02/11/2010'.split('/')))
'2010-11-02'
SilentGhost
  • 307,395
  • 66
  • 306
  • 293
  • 4
    That doesn't work on real data. You are assuming there's nothing else in the string. – tchrist Nov 02 '10 at 13:58
  • 5
    The date is a field I extract from a CSV record. There is nothing else in the field. It works perfectly for my requirements. – FunLovinCoder Nov 02 '10 at 14:16
  • 1
    @swisstony: then your question is a little misleading; it implies two problems: finding the dates in a text file and converting them (presumably in place) – ysth Nov 02 '10 at 14:42
  • 1
    @swisstony, that isn’t how your question currently reads. You should have given more constraints if more constraints apply. Otherwise people will pluck canned solutions from this that don’t end up working correctly in less constrained situations. – tchrist Nov 02 '10 at 14:51
  • 13
    @tchrist: question is quite unambiguous: *I need to convert from `dd/mm/yyyy` to `yyyy-mm-dd` format*. – SilentGhost Nov 02 '10 at 14:53
  • 2
    @silentghost: The question says "in a text file". It is therefore reasonable to make something that works when such dates are encountered within a text file. Your solution does not work for such things. It has the same bug (amongst others) as my FIRST METHOD given below. Rigged demos are easy to arrange; real-word processing may not be. – tchrist Nov 02 '10 at 14:56
  • @tchrist: OP **doesn't have this problem** and you were told so **an hour ago**. Please, spare me wrath of Perl programmers and leave comments related to the question in the relevant comments section. Not as a comments to this answer. – SilentGhost Nov 02 '10 at 15:00
  • 1
    @silentghost: I certainly was **not** so notified "over an hour ago". I answered the problem as it was posed, and at the time I answered it, no such constraints were given. You misjudge how long it took me to type that. The problem should have been better specified. And I'll leave comments where they belong, thank you very much. – tchrist Nov 02 '10 at 15:05
  • 6
    I am with silentghost on this one...don't solve problems that do not exist. Why assume the data was not validated in it's original form? – frankc Nov 02 '10 at 16:35
  • 4
    @FrankC, I solved the problem as it was posed. It wasn't really completely specified, so I gave multiple answers, showing which ones worked where, and which ones failed where. I try hard not to assume things that aren't in the original question, especially if it might get used as cargo-cult programming. All the `split()` solutions, including mine, break on dates in text files; they work only on isolated instances, which is **NOT** what the question asked about. It may have been what he wanted, but he didn’t ask for that. Hence the multiple answers with differing provisos and approaches. – tchrist Nov 02 '10 at 17:50
  • 1
    Tom, please, either stop being anal or stop posting comments to my answer. **EVERYONE** understands what you're saying. kthxby. – SilentGhost Nov 02 '10 at 17:52
  • 1
    @chattyghost: Which one of the two would you prefer? ☺ – tchrist Nov 02 '10 at 19:11
  • 1
    @tchrist: i'd prefer you taking care of 30 of February with your wonderful regex. – SilentGhost Nov 02 '10 at 19:32
  • 2
    Precisely what would you like me to **do** about “30 February”, eh? I don’t recall date validation being part of the question, but I’d be glad to add it if you’d like. This sort of thing is child’s play — in Perl regexes, that is; I shouldn’t care to imagine the pain that attempting it in Python’s reduced-instruction-set regexes would necessarily entail. No thanks!☻ – tchrist Nov 02 '10 at 19:41
  • Is there location information in the CSV file? You'd need that to determine if this was one of the rare cases where there was a February 30th. – ysth Nov 15 '11 at 21:32
11

Use Time::Piece (in core since 5.9.5), very similar to the Python solution accepted, as it provides the strptime and strftime functions:

use Time::Piece;
my $dt_str = Time::Piece->strptime('13/10/1979', '%d/%m/%Y')->strftime('%Y-%m-%d');

or

$ perl -MTime::Piece
print Time::Piece->strptime('13/10/1979', '%d/%m/%Y')->strftime('%Y-%m-%d');
1979-10-13
$ 
MkV
  • 3,046
  • 22
  • 16
  • I'm always worried about these sorts of solutions because they only convert the data that are valid dates and won't handle the dirty data. It would be nice if all data were clean, but I find they often aren't. – brian d foy Jul 06 '11 at 19:16
  • What invalid dates are you thinking of? Most would cause strptime to throw an error. – MkV Jul 08 '11 at 08:58
6

Go with Perl: the datetime Python package is just broken. You could just do it with regexes to swap the date parts around, eg

echo "17/01/2010" | perl -pe 's{(\d+)/(\d+)/(\d+)}{$3-$2-$1}g'

If you do need to parse these dates (eg to compute their day of week or other calendar-type operations), look into DateTimeX::Easy (you can install it with apt-get under Ubuntu):

perl -MDateTimeX::Easy -e 'print DateTimeX::Easy->parse("17/01/2010")->ymd("-")'
DomQ
  • 4,184
  • 38
  • 37
  • 4
    What's wrong with python's datetime? SilentGhost's answer is how I would do this myself, unless you know of a limitation. – AndrewF Nov 02 '10 at 13:37
  • 2
    the DateTime modules are overkill, Time::Piece is available on CPAN and has been in core since 5.9.5 / 5.10. – MkV Nov 02 '10 at 17:29
5

Perl :

while (<>) {
  s/(^|[^\d])(\d\d)\/(\d\d)\/(\d{4})($|[^\d])/$4-$3-$2/g;
  print $_;
}

Then you just have to run:

perl MyScript.pl < oldfile.txt > newfile.txt
OMG_peanuts
  • 1,797
  • 1
  • 13
  • 19
  • No need for the `MyScript.pl` or the new file. `perl -pi -e 's{(\d{2})/(\d{2})/(\d{4})}{$3-$2-$1}g' file.txt` will perform an inline replace. – Zaid Nov 02 '10 at 15:36
  • @Zaid, that won’t do the right thing on strings like "123/45/67890"; it will make them "16789-56-230". It should leave them alone, I think. You need to set some boundaries. – tchrist Nov 02 '10 at 18:44
  • And here come the boundaries ! – OMG_peanuts Nov 03 '10 at 13:23
1

Perl:

my $date =~ s/(\d+)\/(\d+)\/(\d+)/$3-$2-$1/;
daxim
  • 39,270
  • 4
  • 65
  • 132
Frost.baka
  • 7,851
  • 2
  • 22
  • 18
0

In Perl you can do:

use strict;
while(<>) {
    chomp;
    my($d,$m,$y) = split/\//;
    my $newDate = $y.'-'.$m.'-'.$d;
}
codaddict
  • 445,704
  • 82
  • 492
  • 529
-2

In glorious perl-oneliner form:

echo 17/01/2010 | perl -p -e "chomp;  join('-', reverse split /\//);"

But seriously I would do it like this:

#!/usr/bin/env perl
while (<>) {
    chomp;
    print join('-', reverse split /\//), "\n";
}

Which will work on a pipe, converting and printing one date per line.

sorpigal
  • 25,504
  • 8
  • 57
  • 75
  • 1
    That only works for super simple input strings, not arbitrary text files. – tchrist Nov 02 '10 at 14:57
  • 2
    @who says it has to work for arbitrary text? Don't solve problems that don't exist. The data might have been validated in the existing format. We don't know one way or the other so why make any assumptions? – frankc Nov 02 '10 at 16:33
  • In fact the author has stated elsewhere that arbitrary text is not required as these are well known text strings from a CSV file. tchrist made this same comment on ther code elsewhere with just as little justification. – sorpigal Nov 02 '10 at 18:38
  • 1
    The problem spec just mentioned that such date string occurred in text files. It at no point stated that the strings to be changed were not really occurring in text files; it fact, he said they were. Therefore solutions that pretend the strings to exist in complete isolation get all kinds of things wrong. You cannot ding people **for answering the question that was asked**. If the questioner had been more careful in his spec, people wouldn't have had to guess. – tchrist Nov 02 '10 at 18:47
  • I cannot fault you for your answer, which was extremely precise and correct, but this doesn't make any of the less-precise answers incorrect. You claim that the off-hand reference to "in text files" means "in arbitrary streams of text requiring the dates to first be parsed out and then converted" whereas I, and others, interpreted this question correctly as being about converting a string representation of a date once that string is obtained and not about parsing it out of a text file. You cannot ding people for answering the question that was asked and not the broader question it implied. – sorpigal Nov 03 '10 at 10:27