0

I'm reading a csv file. Each row has different values, but I'm only interested in first and second values, which have the following format:

2015-11-02 10:07:33,2015-11-02 10:07:52

I need to get the elapsed time between both. My code is:

file = pd.read_csv('file.csv', header=None, skiprows=1, index_col=False,
               chunksize=1000000, usecols=[1, 2], names=['ts', 'te'], na_values=['n/a','N/A','nan','NaN'],
              dtype={'ts':datetime, 'te':datetime})

for chunk in file:
    chunk['duration'] = chunk['te']-chunk['ts']

But I get the following error:

TypeError: unsupported operand type(s) for -: 'str' and 'str'

What can I do? Thank you very much.

Update: My problem is solved. But now I need to cast every time elapsed chunk into a float. Thank you.

elena.bdc
  • 89
  • 1
  • 2
  • 6

1 Answers1

3

Parse date & time columns using pandas module:

from pprint import pprint
import pandas as pd

file = pd.read_csv('file.csv', header=None, skiprows=1, index_col=False,
                   chunksize=1000000,
                   na_values=['n/a','N/A','nan','NaN'],
                   usecols=[0, 1], names=['ts', 'te'],
                   parse_dates=['ts', 'te'])

for chunk in file:
    chunk['duration'] = (chunk['te'] - chunk['ts']) / pd.np.timedelta64(1, 's') # converts duration to seconds
    pprint(chunk)

Example

file.csv file:

start-date,end-date,text
2015-11-02 10:07:33,2015-11-02 10:07:52,foo
2015-11-02 11:07:33,2015-11-02 11:08:52,bar

outputs:

                   ts                  te  duration
0 2015-11-02 10:07:33 2015-11-02 10:07:52      19.0
1 2015-11-02 11:07:33 2015-11-02 11:08:52      79.0
luka5z
  • 7,525
  • 6
  • 29
  • 52
  • First of all thanks for your answer. It works for me. But now I have a second problem, how can I cast the elapsed time (it's a timedelta64[ns]) to a float? Thank you. – elena.bdc Jun 13 '16 at 10:59