2

I tried different things for too long now.

How do I load csv data containing dates into a numpy array? This is what doesn't work. It creates a single line with everything that is supposed to be a line now being in a single cell.

import io
import numpy as np
import datetime as dt


def date_parser(d_bytes):
    s = d_bytes.decode('utf-8')
    return np.datetime64(dt.datetime.strptime(s, "%Y-%m-%d %H:%M:%S"))


def read_csv():
    five_min_candles_str = """2020-06-01 17:05:00,9506.01,9523.31,9500.0,9514.52
2020-06-01 17:10:00,9513.44,9525.22,9500.32,9522.0
2020-06-01 17:15:00,9521.56,9525.59,9513.75,9523.53
2020-06-01 17:20:00,9523.21,9525.53,9518.78,9524.55
2020-06-01 17:25:00,9524.55,9538.4,9522.93,9528.73
2020-06-01 17:30:00,9528.73,9548.98,9527.95,9543.72
2020-06-01 17:35:00,9542.71,9547.34,9536.57,9543.66
2020-06-01 17:40:00,9543.67,9543.67,9530.0,9531.85
2020-06-01 17:45:00,9530.84,9535.01,9524.1,9526.75
2020-06-01 17:50:00,9526.47,9538.64,9521.87,9534.57
2020-06-01 17:55:00,9534.58,9548.9,9533.04,9546.98
2020-06-01 18:00:00,9548.18,9558.9,9536.99,9556.25
2020-06-01 18:05:00,9556.15,9579.8,9547.7,9574.09
2020-06-01 18:10:00,9575.0,9592.59,9571.3,9573.93
2020-06-01 18:15:00,9573.68,9610.0,9569.6,9597.78
2020-06-01 18:20:00,9597.78,9598.85,9578.0,9591.39
                """
    nparray = np.genfromtxt(io.StringIO(five_min_candles_str),
                            delimiter=',',
                            dtype=[('Timestamp','datetime64[us]'),
                                   ('Open','object'),
                                   ('High','object'),
                                   ('Low','object'),
                                   ('Close','object')],
                            converters={0: date_parser},
                            )
    print(nparray)


if __name__ == "__main__":
    read_csv()

A solution or hint would be much appreciated!

Edit: It turned out that it was indeed working already but I expected an 2D array while it became an array of tuples after I've added the types or the converter. The reason for that are the different types in a row. See the other SO question

I marked the answer below as correct anyways as I like it more because it doesn't need any custom parsing of the date and I also like the splitlines() solution more compared to io.StringIO()

Sandro
  • 1,757
  • 1
  • 19
  • 29
  • Why don't you use `pandas`? – Quang Hoang Jun 02 '20 at 20:43
  • Because I would like to iterate over the rows which I learned is an anti pattern with `pandas`. – Sandro Jun 02 '20 at 20:45
  • row iteration is discourage **if** you can find vectorized function, but it's there for you to use at any time. And in this case, `pandas` helps you simplify your intake of csv file a lot. – Quang Hoang Jun 02 '20 at 20:46
  • The data is not so diverse so the reading would be pretty much done. If I can find a solution here numpy would still be the preferred way to go I guess. But I don't have much experience in both of them right now to really judge that. – Sandro Jun 02 '20 at 20:59
  • I'll use it to as well to append rows one by one and after each addition calling mathematical functions (technical analysis indicators) on the last n rows. I'm actually not sure if pandas or numpy is the best fit for that and why. Any good arguments are much appreciated! – Sandro Jun 02 '20 at 21:04
  • 1
    Row iteration in `pandas` is no worse than in `numpy`. We see plenty of SO using `pandas` `apply`. For speed it is nice to operate on the whole dataframe/array, but some tasks might be too complicated to do that. – hpaulj Jun 02 '20 at 21:51
  • `df.to_numpy()` produces a numpy array from a DataFrame. so it's not an either /or decision. Choose the loader that's easiest to use. – hpaulj Jun 02 '20 at 21:55
  • Yea, I tried that as well. But I loose the date in the index. Instead only numbers are there then. – Sandro Jun 02 '20 at 21:56
  • Based on my test, I think all you are missing is the `splitlines()`. – hpaulj Jun 03 '20 at 02:33
  • Well it works with `io.StringIO(five_min_candles_str)` as well but I like `splitlines()` more. Thanks for the hint! – Sandro Jun 03 '20 at 15:50

2 Answers2

2
In [53]: five_min_candles_str = """2020-06-01 17:05:00,9506.01,9523.31,9500.0,95
    ...: 14.52 
    ...: 2020-06-01 17:10:00,9513.44,9525.22,9500.32,9522.0 
    ...: 2020-06-01 17:15:00,9521.56,9525.59,9513.75,9523.53 
    ...: 2020-06-01 17:20:00,9523.21,9525.53,9518.78,9524.55 
    ...: 2020-06-01 17:25:00,9524.55,9538.4,9522.93,9528.73 
    ...: 2020-06-01 17:30:00,9528.73,9548.98,9527.95,9543.72 
    ...: 2020-06-01 17:35:00,9542.71,9547.34,9536.57,9543.66 
    ...: 2020-06-01 17:40:00,9543.67,9543.67,9530.0,9531.85 
    ...: 2020-06-01 17:45:00,9530.84,9535.01,9524.1,9526.75 
    ...: 2020-06-01 17:50:00,9526.47,9538.64,9521.87,9534.57 
    ...: 2020-06-01 17:55:00,9534.58,9548.9,9533.04,9546.98 
    ...: 2020-06-01 18:00:00,9548.18,9558.9,9536.99,9556.25 
    ...: 2020-06-01 18:05:00,9556.15,9579.8,9547.7,9574.09 
    ...: 2020-06-01 18:10:00,9575.0,9592.59,9571.3,9573.93 
    ...: 2020-06-01 18:15:00,9573.68,9610.0,9569.6,9597.78 
    ...: 2020-06-01 18:20:00,9597.78,9598.85,9578.0,9591.39 
    ...: """                                                                    

Let's see how numpy handles these date strings. It's not as robust as pandas, but:

In [55]: np.array('2020-06-01 17:05:00', 'datetime64[s]')                       
Out[55]: array('2020-06-01T17:05:00', dtype='datetime64[s]')

but looks ok. The space between date and time is ok ('T' also works).

So lets try a fully automatic dtype:

In [56]: data=np.genfromtxt(five_min_candles_str.splitlines(), delimiter=',', dt
    ...: ype=None, encoding=True)                                               
In [57]: data                                                                   
Out[57]: 
array([('2020-06-01 17:05:00', 9506.01, 9523.31, 9500.  , 9514.52),
       ('2020-06-01 17:10:00', 9513.44, 9525.22, 9500.32, 9522.  ),
       ('2020-06-01 17:15:00', 9521.56, 9525.59, 9513.75, 9523.53),
        ...
       ('2020-06-01 18:20:00', 9597.78, 9598.85, 9578.  , 9591.39)],
      dtype=[('f0', '<U19'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])

SO we need to specify the datetime dtype (editing that dtype):

In [58]: dt = [('f0', 'datetime64[s]'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')] 

In [59]: data=np.genfromtxt(five_min_candles_str.splitlines(), delimiter=',', dtype=dt, encoding=True)                                                 
In [60]: data                                                                   
Out[60]: 
array([('2020-06-01T17:05:00', 9506.01, 9523.31, 9500.  , 9514.52),
       ('2020-06-01T17:10:00', 9513.44, 9525.22, 9500.32, 9522.  ),
       ('2020-06-01T17:15:00', 9521.56, 9525.59, 9513.75, 9523.53),
       ('2020-06-01T17:20:00', 9523.21, 9525.53, 9518.78, 9524.55),
      ...
       ('2020-06-01T18:20:00', 9597.78, 9598.85, 9578.  , 9591.39)],
      dtype=[('f0', '<M8[s]'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Ok, got it. The data was looking wrong in the IDE but it was indeed ok. I still like your answer more as there is no parsing needed. – Sandro Jun 03 '20 at 15:55
0

for completeness, this is pretty much a one-liner using pandas:

from  io import StringIO
import pandas as pd

s="""2020-06-01 17:05:00,9506.01,9523.31,9500.0,9514.52
2020-06-01 17:10:00,9513.44,9525.22,9500.32,9522.0
2020-06-01 17:15:00,9521.56,9525.59,9513.75,9523.53
2020-06-01 17:20:00,9523.21,9525.53,9518.78,9524.55
2020-06-01 17:25:00,9524.55,9538.4,9522.93,9528.73
2020-06-01 17:30:00,9528.73,9548.98,9527.95,9543.72
2020-06-01 17:35:00,9542.71,9547.34,9536.57,9543.66
2020-06-01 17:40:00,9543.67,9543.67,9530.0,9531.85
2020-06-01 17:45:00,9530.84,9535.01,9524.1,9526.75
2020-06-01 17:50:00,9526.47,9538.64,9521.87,9534.57
2020-06-01 17:55:00,9534.58,9548.9,9533.04,9546.98
2020-06-01 18:00:00,9548.18,9558.9,9536.99,9556.25
2020-06-01 18:05:00,9556.15,9579.8,9547.7,9574.09
2020-06-01 18:10:00,9575.0,9592.59,9571.3,9573.93
2020-06-01 18:15:00,9573.68,9610.0,9569.6,9597.78
2020-06-01 18:20:00,9597.78,9598.85,9578.0,9591.39"""

df = pd.read_csv(StringIO(s), names=['Timestamp','Open','High','Low','Close'],
                 parse_dates=[0])

for c in df.columns:
    print(f"{c} - {df[c].dtype}")
# Timestamp - datetime64[ns]
# Open - float64
# High - float64
# Low - float64
# Close - float64
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • Yea, that worked pretty much out of the box for me as well. `parse_dates=True` is enough btw. – Sandro Jun 03 '20 at 15:24