0

I have the following type of .csv data in huge quantity:

Red     11.1    12170612    137186880.5 01-Apr-14
pink    52.4    35139204    1902951951  01-Aug-11
yellow  18.45   15074808    281444556.8 01-Aug-12

How do I sort it by date (the 5th column above)?

Andrew Janke
  • 23,508
  • 5
  • 56
  • 85
rkg
  • 1
  • 1
  • You could write something to [convert that date into a unix timestamp](http://stackoverflow.com/questions/9637838/convert-string-date-to-timestamp-in-python), sort it, and then convert back. – John Dorian May 04 '15 at 21:35
  • Please define huge in GB. – Michas May 04 '15 at 21:39
  • I got say 6,000,000 lines of such data and need to sort them by time and save as csv file. I tried various options on this website but can't tackle the above date format. – rkg May 04 '15 at 21:43
  • Show us one of the options you tried, so that we can know what you're missing. – Robᵩ May 04 '15 at 21:44
  • Sorry, I was not very specific about the data last time. Here it is - data = [['pink', 9644921,'01-APR-2010','0'], ['yellow', 2243817504.85, '01-APR-2011','0'], ['pink', 223468, '01-APR-2013', '23891'], ['orange', 137186880.5, '01-APR-2014', '1'], ['yellow', 51950.8, '01-AUG-2011', '87674']] How do I sort this date wise? Thanks for your help. – rkg May 04 '15 at 21:58
  • So is your data a list or a csv? – Scott May 04 '15 at 22:24

3 Answers3

0

Assuming the date always starts in column 40 and runs to the end of the line, this will sort your data:

with open('foo.in') as input_file:
    lines = input_file.readlines()
lines.sort(key=lambda x:datetime.datetime.strptime(x[40:-1], '%d-%b-%y'))

If you have some other way of isolating the date (always 5th field, always last 9 chars of a line, whatever), adjust the lambda accordingly.

Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • My data does not always start from a certain column. Please see the new data sample in my comments above. Thanks for your help! – rkg May 04 '15 at 22:00
0

You can use pandas:

import pandas as pd
data = pd.read_csv('date.csv', sep='\t', header=None)
data[4] = pd.to_datetime(data[4])
data.sort(4)

Out[2]:

    0       1       2            3              4
1   pink    52.40   35139204    1.902952e+09    2011-08-01
2   yellow  18.45   15074808    2.814446e+08    2012-08-01
0   Red     11.10   12170612    1.371869e+08    2014-04-01

where 'date.csv' is the data you posted in your question:

Red 11.1    12170612    137186880.5 01-Apr-14
pink    52.4    35139204    1902951951  01-Aug-11
yellow  18.45   15074808    281444556.8 01-Aug-12
Scott
  • 6,089
  • 4
  • 34
  • 51
0

For your updated data:

import numpy as np
import pandas as pd

your_data = [['pink', 9644921,'01-APR-2010','0'], ['yellow', 2243817504.85, '01-APR-2011','0'], ['pink', 223468, '01-APR-2013', '23891'], ['orange', 137186880.5, '01-APR-2014', '1'], ['yellow', 51950.8, '01-AUG-2011', '87674']]
aa = np.array(your_data)
data = pd.DataFrame(aa)
data[2] = pd.to_datetime(data[2])
data = data.sort(2)
print data

It will give:

        0              1          2      3
0    pink        9644921 2010-04-01      0
1  yellow  2243817504.85 2011-04-01      0
4  yellow        51950.8 2011-08-01  87674
2    pink         223468 2013-04-01  23891
3  orange    137186880.5 2014-04-01      1
Ashish
  • 729
  • 1
  • 9
  • 23