2

I have an array, recognised as a 'numpy.ndarray object' which prints the following output when running the following code:

with sRW.SavReaderNp('C:/Users/Sam/Downloads/Data.sav') as reader:
record = reader.all()
print(record)

Output:

[(b'61D8894E-7FB0-3DE6-E053-6C04A8C01207', b'Sam', 250000., '2019-08-05T00:00:00.000000')
 (b'61D8894E-7FB0-3DE6-E053-6C04A8C01207', b'James',  250000., '2019-08-05T00:00:00.000000')
 (b'61D8894E-7FB0-3DE6-E053-6C04A8C01207', b'Mark', 250000., '0001-01-01T00:00:00.000000')

I really want to process empty date variables within a pandas DataFrame using pd.DataFrame format, but when I run the following code an error appears (as shown bellow the code):

SPSS_df = pd.DataFrame(record)

Error: "Out of bounds nanosecond timestamp: 1-01-01 00:00:00"

I've read through the source code of SavReader Module Documentation and it says if a Datetime value is not found, the following date is assigned:

datetime.datetime(datetime.MINYEAR, 1, 1, 0, 0, 0)

I wondered how could I process this date without encountering this error, perhaps changing/maniuplating this code above?

  • 2
    Why don't you convert `record` to a pandas dataframe? – PythonSherpa Oct 08 '19 at 07:23
  • Because an error occurs. If a date variable is missing from the SPSS file. So I'm trying to change it before i convert it to a pandas dataframe –  Oct 08 '19 at 18:29
  • Can you try this? First, `import numpy as np`, than change reading to: `with sRW.SavReaderNp('C:/Users/Sam/Downloads/Data.sav', rawMode=False, recodeSysmisTo=np.nan) as reader:` – PythonSherpa Oct 08 '19 at 20:07
  • I've tried doesn't work unfortunately –  Oct 08 '19 at 20:21
  • Can you post an exact copy of the output when dates are missing? Is record a `list` of `tuples`? – PythonSherpa Oct 08 '19 at 20:27
  • value = tslibs.conversion.ensure_datetime64ns(value) File "pandas\_libs\tslibs\conversion.pyx", line 123, in pandas._libs.tslibs.conversion.ensure_datetime64ns File "pandas\_libs\tslibs\np_datetime.pyx", line 118, in pandas._libs.tslibs.np_datetime.check_dts_bounds pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 00:00:00 –  Oct 08 '19 at 20:31
  • Ok, and also an exact copy of sample data in `record`? Is it a string or a list of tuples? – PythonSherpa Oct 09 '19 at 05:57
  • I've show this above, the first two columns are strings hence the b', the next is a float and finally a datetime variable –  Oct 09 '19 at 06:21
  • If it is a list, you can use a list comprehension like `record = [(x[0], x[1], x[2], np.nan) if x[3] == '0001-01-01T00:00:00.000000' else x for x in record]` – PythonSherpa Oct 09 '19 at 10:05
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/200620/discussion-between-sam-and-hoenie). –  Oct 09 '19 at 18:26
  • Ahh, It’s a numpy.ndarray object? So perhaps an extension to this may help? –  Oct 09 '19 at 18:27
  • have you thought of overriding the builtin function which assigns the default time.. ? that way you can give the nano seconds the way you want. – Yatish Kadam Oct 10 '19 at 17:50
  • @Sam have you considered [this post](https://stackoverflow.com/questions/39905822/out-of-bounds-nanosecond-timestamp?rq=1)? It seems to suggest that either setting an explicit time format string or setting the `dayfirst` field may have previously solved this problem? I'm sure it's super obvious as it's one of the top search results on this error, but thought I'd mention. – mayosten Oct 10 '19 at 17:53
  • i don't understand your problem, you are not able to create a dataframe using `record`? – Mohsen_Fatemi Oct 10 '19 at 17:56
  • Yes @Yatish Kadam, I'm just not sure how to do this –  Oct 10 '19 at 19:09
  • @Sam this is a good example of overriding built in functions https://stackoverflow.com/questions/58173218/how-can-i-override-shadow-another-modules-function-something-like-a-shim-or-a – Yatish Kadam Oct 10 '19 at 19:18
  • This error occurs when you try to convert date before 1970-01-01. You should google "posix time" for details. In your case, you can read in the datetime as strings, and deal with it. – Ian Oct 11 '19 at 06:04

1 Answers1

0

What you can do, is read all the records as strings (object) and after convert the column into the wanted type (float and datetimes)

import numpy as np
import pandas as pd

record = [
    (
        b'61D8894E-7FB0-3DE6-E053-6C04A8C01207',
        b'Sam',
        250000.0,
        '2019-08-05T00:00:00.000000',
    ),
    (
        b'61D8894E-7FB0-3DE6-E053-6C04A8C01207',
        b'James',
        250000.0,
        '2019-08-05T00:00:00.000000',
    ),
    (
        b'61D8894E-7FB0-3DE6-E053-6C04A8C01207',
        b'Mark',
        250000.0,
        '0001-01-01T00:00:00.000000',
    ),
]

SPSS_df = pd.DataFrame(record, dtype=object).rename(
    {2: 'some_float', 3: 'dates'}, axis='columns'
).assign(
    some_float=lambda x: x['some_float'].astype(np.float),
    dates=lambda x: pd.to_datetime(x['dates'], errors='coerce'),
)

This gives:

0  b'61D8894E-7FB0-3DE6-E053-6C04A8C01207'    b'Sam'    250000.0 2019-08-05
1  b'61D8894E-7FB0-3DE6-E053-6C04A8C01207'  b'James'    250000.0 2019-08-05
2  b'61D8894E-7FB0-3DE6-E053-6C04A8C01207'   b'Mark'    250000.0        NaT

and the types:

SPSS_df.dtypes
0                     object
1                     object
some_float           float64
dates         datetime64[ns]
ndclt
  • 2,590
  • 2
  • 12
  • 26
  • While this avoids the error I don't have 'NAT' in missing date spaces. I have random dates '1754-08-30 22:43:41.128654' –  Oct 14 '19 at 06:20
  • Can you give me an example of a missing date spaces? – ndclt Oct 14 '19 at 20:02
  • Instead of dtype='object', dtype='str' worked? I'm not sure why, and it removed the b' prefix –  Oct 14 '19 at 20:14
  • The `str` dtype convert all your data into string. Your previous chain of characters were in bytes. – ndclt Oct 14 '19 at 20:40