7

I'm trying to import a simple CSV file with Numpy genfromtxt but can't manage to convert the data of first column to dates.

Here is my code:

import numpy as np
from datetime import datetime

str2date = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

data = np.genfromtxt('C:\\\\data.csv',dtype=None,names=True, delimiter=',', converters = {0: str2date})

I get the following error in str2date:

TypeError: must be str, not bytes

The problem is there are many columns, so I'd prefer avoiding the specification of all the column types (which are basically numerical).

Mark Morrisson
  • 2,543
  • 4
  • 19
  • 25

3 Answers3

7

The problem is that the argument passed to str2date is of this form b'%Y-%m-%d %H:%M:%S'. These are bytes, which rightfully cannot be parsed to a datetime object. The solution to that problem is quite simple though, as you should decode your byte string to a UTF-8 string:

str2date = lambda x: datetime.strptime(x.decode("utf-8"), '%Y-%m-%d %H:%M:%S')

Nelewout
  • 6,281
  • 3
  • 29
  • 39
  • 1
    Thanks. I don't understand why the type chosen by genfromtxt is a byte array instead of a string... Anyway, I did your modification, ran the loading but got another error: ConverterError: Converter #0 is locked and cannot be upgraded: (occurred line #1 for value 'b'2011-01-01 00:00:00'') though the date is valid... – Mark Morrisson Jan 07 '16 at 16:01
  • 1
    @MarkMorrisson see my updated answer! I hope this helps. – Nelewout Jan 07 '16 at 23:40
  • thanks, it works perfecftly now! Could you explain me the trick with UTF8? It looks complicated to import a date... – Mark Morrisson Jan 08 '16 at 15:08
  • @MarkMorrisson you are not specifying the `dtype` for you columns, so `numpy` is guessing the best possible fit, which it thinks is a byte string. We just need to decode that to a more 'human' format, like UTF-8. If this answer resolves your problem, consider marking it as accepted :). – Nelewout Jan 08 '16 at 15:30
  • Oh, I understood the problem, str() applied to a byte string keeps the 'b' within the final string, which is absurd... – Mark Morrisson Jan 08 '16 at 22:34
  • @MarkMorrison [Format Specification](https://docs.python.org/2/library/string.html#format-specification-mini-language): a `b''` string is a valid string representation, just not the desired one for this particular use case. – Nelewout Jan 08 '16 at 22:41
1

When we want to read in a csv file a column whose value represents a date, we must take into account how it is represented, for example:

- 2021/12/05 = %Y/%m/%d
- 21/12/05 = %y/%m/%d
- 05/12/2021 = %d/%m/%Y
- 05/12/21 = %d/%m/%y
- 05-12-21 = %d-%m-%y
- ...

These ways of representing the date must be taken into account in the creation of the lambda function that we will use as a converter in the NumPy getfromtxt() method. This method accepts several parameters and among them, we can find converters that we can use in different ways, in this case, it will be to convert the values of a column into date type values

converters variable, optional

    The set of functions that convert the data of a column to a value. The converters can also be used to provide a default value for missing data: 
     converters = {num_col: lambda_function }.

num_col - represents the number of the column to which the function will be applied

lambda_function - represents the function that we will build for the conversion

For this example, we will have two columns, date and level, separated by (;) and utf-8 coding:

date level
02-03-15 232.8
09-03-15 233.0
16-03-15 233.2
23-03-15 233.6
30-03-15 233.9
06-04-15 234.3
13-04-15 234.8
20-04-15 235.3
27-04-15 235.9

Our code should be:

import numpy as np
from datetime import datetime

str2date = lambda x: datetime.strptime(x, '%d-%m-%y')
data = np.genfromtxt(file_path, delimiter=';', dtype=None, names=True, converters = {0: str2date}, encoding='utf-8')

The variable file_path will be replaced by the directory of the file, including the name of the file and its extension.

The delimiter: str, int, or sequence, optional. The string used to separate values. By default, any consecutive whitespaces act as delimiter. An integer or sequence of integers can also be provided as width(s) of each field.

The dtype : dtype, optional. Data type of the resulting array. If None, the dtypes will be determined by the contents of each column, individually.

The names : {None, True, str, sequence}, optional. If names are True, the field names are read from the first line after the first skip_header lines. This line can optionally be proceeded by a comment delimiter. If names is a sequence or a single-string of comma-separated names, the names will be used to define the field names in a structured dtype. If names are None, the names of the dtype fields will be used, if any.

The encoding: str, optional. The encoding used to decode the input file.

To extract the data and work with it we can:

levels= data['level']
dates= data['date']
-3

this is very good idea. I had same problem, when I tried use numpy for Python 3.4. For python 2.7.10 it is not necesarry. Thank you. :-) This is my sample.

File input:

06-07-2016,95.5300,30877540.0000,94.6000,95.6600,94.3700
05-07-2016,95.0400,27553750.0000,95.3900,95.4000,94.4600
01-07-2016,95.8900,25982080.0000,95.4900,96.4650,95.3300*

Code:

dates = numpy.loadtxt(
            'data.csv',
            dtype = object,
            converters={0: lambda x: datetime.datetime.strptime(x.decode("utf-8"), "%d-%m-%Y")},
            delimiter=',',
            usecols=(0,),
            unpack=True
)
Vladimir Vagaytsev
  • 2,871
  • 9
  • 33
  • 36