Split timestamp column CSV

Question

I have a CSV file in the following format:

name, lat, lon, alt, time
id1, 40.436047, -74.814883, 33000, 2016-01-21T08:08:00Z

I am trying to use Python to split the time into new columns so it looks like this:

name, lat, lon, alt, year, month, day, hour, min, sec
id1, 40.436047, -74.814883, 33000, 2016,-01,-21, 08, 08, 00

I also want to set the amount of places in the float columns to always be set to 5 decimal places.

This is the script I have so far:

import numpy as np

name,lat,lon,alt,time = np.loadtxt(
    'test_track.csv',
    delimiter=',',
    dtype='str',
    skiprows=1,
    unpack = True
    )


year = time[0:3]
print year

Unfortunately, instead of parsing the time into year, it prints out the first full times instead of just the year.

Use `datetime.strptime` to parse out the datetime, don't split it — OneCricketeer, Oct 03 '16 at 20:51
Are you trying to load this data into one or more `numpy` arrays to do calculations, or are you just wanting to reformat the file - that is, just write the same data back to file with the new format? You don't need numpy to edit the file. — hpaulj, Oct 03 '16 at 20:57

Ji Mun · Accepted Answer · 2016-10-04T23:24:03.607

2

[Edited + actually ran it on my computer this time..]

Like others mentioned, I think it would suffice to use the built-in libraries to do what you want to do. And using dateutil parser should allow you to use the datetime column in an easy way.

But if you still want to re-create the CSV file...

To create a new CSV file in that format that you want, you can do something like:

#!/usr/bin/env python

import dateutil.parser
import csv

with open('original.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')

    write_file = open("new.csv", 'w')

    reading_label_line = True

    for row in reader:
        if reading_label_line:
            reading_label_line = False
            write_file.write("name, lat, lon, alt, year, month, day, hour, min, sec\n")
        else:
            dt = dateutil.parser.parse(row[-1])
            row = row[0:len(row)-1] # cut off the last item (datetime)
            row.append(dt.year)
            row.append(dt.month)
            row.append(dt.day)
            row.append(dt.hour)
            row.append(dt.minute)
            row.append(dt.second)
            write_file.write(', '.join(str(x) for x in row) + '\n')

If you want to keep the '-' in front of month and day, just add the dash in front of dt.month and dt.day.

edited Oct 04 '16 at 23:24

answered Oct 03 '16 at 21:18

Ji Mun

1,800
4
18
27

What's `read_file`? `reader`? What's the `row << dt.year` syntax? That doesn't look like `python`. – hpaulj Oct 03 '16 at 21:57
@hpaulj - Yeah you are right... I had the syntax mixed up with different files and had in some extra lines I didn't clean up there. Thanks for pointing it out! – Ji Mun Oct 03 '16 at 23:22
@JiMun I tried running the code you suggested and got the error: python test.py File "test.py", line 7 SyntaxError: Non-ASCII character '\xe2' in file test.py on line 7, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details – Waterguy Oct 04 '16 at 13:47
@Waterguy - Ah I fixed those and this time I ran and fixed other mistakes too. (This is my first time answering a question on StackOverflow so thank you for pointing that out!) – Ji Mun Oct 04 '16 at 23:24

June Skeeter · Answer 2 · 2016-10-03T21:24:20.240

0

You should try importing the data with Pandas instead of numpy. Panda read_csv handles dates quite nicely

try something like this

import pandas as pd
yourData = pd.read_csv(yourData_Path,delimiter = ',',skiprows = 0,
           parse_dates={'time':[-1]},header = 1,na_values = -9999)

Pandas also allows you to index by the datetimes which is quite nice :)

edited Oct 03 '16 at 21:24

answered Oct 03 '16 at 21:09

June Skeeter

1,142
2
13
27

hpaulj · Answer 3 · 2016-10-04T02:57:48.843

This answer follows your lead in using loadtxt, and hopefully explains what you got, and alternatives. But if you aren't doing any calculations, it may be simpler to just read each line, split it, and write it back in the desire format. A csv reader may make that task simpler, but is not essential. Plain Python line read and writes, and string manipulation will work.

============

Using a string copy of your sample (bytestring in PY3):

In [296]: txt=b"""name, lat, lon, alt, time
     ...: id1, 40.436047, -74.814883, 33000, 2016-01-21T08:08:00Z
     ...: id2, 40.436047, -74.814883, 33000, 2016-01-21T08:08:00Z""".splitlines(
     ...: )
In [297]: txt
Out[297]: 
[b'name, lat, lon, alt, time',
 b'id1, 40.436047, -74.814883, 33000, 2016-01-21T08:08:00Z',
 b'id2, 40.436047, -74.814883, 33000, 2016-01-21T08:08:00Z']

In [298]: data=np.loadtxt(txt,delimiter=',',dtype=np.string_,skiprows=1)
In [299]: data
Out[299]: 
array([[b'id1', b' 40.436047', b' -74.814883', b' 33000',
        b' 2016-01-21T08:08:00Z'],
       [b'id2', b' 40.436047', b' -74.814883', b' 33000',
        b' 2016-01-21T08:08:00Z']], 
      dtype='|S21')
In [300]: data[:,4]
Out[300]: 
array([b' 2016-01-21T08:08:00Z', b' 2016-01-21T08:08:00Z'], 
      dtype='|S21')

Or with the unpack

In [302]: name,lat,lon,alt,time=np.loadtxt(txt,delimiter=',',dtype=np.string_,sk
     ...: iprows=1,unpack=True)
In [303]: time
Out[303]: 
array([b' 2016-01-21T08:08:00Z', b' 2016-01-21T08:08:00Z'], 
      dtype='|S21')

we've loaded the file as a 2d array of strings, or 5 1d arrays. time is an array of strings.

I can convert this array of strings into an array of datatime objects:

In [307]: time1 = time.astype(np.datetime64)
In [308]: time1
Out[308]: array(['2016-01-21T08:08:00', '2016-01-21T08:08:00'], dtype='datetime64[s]')
In [309]: time1[0]
Out[309]: numpy.datetime64('2016-01-21T08:08:00')

I could even load it directly with datetimes. But this doesn't solve your display issues.

=====================

genfromtxt gives more power to loading different column types

In [312]: np.genfromtxt(txt,dtype=None,skip_header=1,delimiter=',')
Out[312]: 
array([(b'id1', 40.436047, -74.814883, 33000, b' 2016-01-21T08:08:00Z'),
       (b'id2', 40.436047, -74.814883, 33000, b' 2016-01-21T08:08:00Z')], 
      dtype=[('f0', 'S3'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<i4'), ('f4', 'S21')])

this gives a mix of string, floats and int. The dates are still strings.

If I replace the dtype=None with a specific dtype, I can dates as before:

In [313]: dt=['S3','f','f','i','datetime64[s]']

In [315]: data=np.genfromtxt(txt,dtype=dt,skip_header=1,delimiter=',')

In [316]: data
Out[316]: 
array([ (b'id1', 40.4360466003418, -74.81488037109375, 33000, datetime.datetime(2016, 1, 21, 8, 8)),
       (b'id2', 40.4360466003418, -74.81488037109375, 33000, datetime.datetime(2016, 1, 21, 8, 8))], 
      dtype=[('f0', 'S3'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<i4'), ('f4', '<M8[s]')])
In [317]: data['f4']
Out[317]: array(['2016-01-21T08:08:00', '2016-01-21T08:08:00'], dtype='datetime64[s]')

===============

A first cut at writing this back out to file

In [318]: np.savetxt('test.txt',data,fmt='%4s, %.5f, %.5f, %d, %s')

In [320]: cat test.txt
b'id1', 40.43605, -74.81488, 33000, 2016-01-21T08:08:00
b'id2', 40.43605, -74.81488, 33000, 2016-01-21T08:08:00

Controlling the float precision is obvious. I need to fix the 1st byte string display. And it does not split the date - I'm just displaying is normal string representation.

=================

You can convert the np.datetime64 array into an array of datetime objects:

In [361]: from datetime import datetime
In [362]: data['f4'].astype(datetime)
Out[362]: 
array([datetime.datetime(2016, 1, 21, 8, 8),
       datetime.datetime(2016, 1, 21, 8, 8)], dtype=object)

I can convert this into an array of strings with comma delimiter:

In [383]: tfmt='%Y, %m, %d, %H, %M, %S'
In [384]: timefld=data['f4'].astype(datetime)
In [385]: timefld = np.array([d.strftime(tfmt) for d in timefld])
In [386]: timefld
Out[386]: 
array(['2016, 01, 21, 08, 08, 00', '2016, 01, 21, 08, 08, 00'], 
      dtype='<U24')

=========================

A pure text editing approach could use functions like

def foo(dtstr):
    return dtstr.replace(b'-',b', ').replace(b':',b', ').replace(b'T',b', ').replace(b'Z',b'')

def foo(dtstr):
    # cleaner version with re
    import re
    return re.sub(b'[-:T]',b', ',dtstr[:-1])


def editline(aline):
    aline=aline.split(b',')
    aline[4]=foo(aline[4])
    return b', '.join(aline)

In [408]: [editline(aline) for aline in txt[1:]]
Out[408]: 
[b'id1,  40.436047,  -74.814883,  33000,  2016, 01, 21, 08, 08, 00',
 b'id2,  40.436047,  -74.814883,  33000,  2016, 01, 21, 08, 08, 00']

Split timestamp column CSV

3 Answers3