2

I'm trying to cast pyarrow timestamp type of time64 type. But it's showing cast error.

import pyarrow as pa
from datetime import datetime

dt = datetime.now()
table = pa.Table.from_pydict({'ts': pa.array([dt, dt])})
new_schema = table.schema.set(0, pa.field('ts', pa.time64('us')))
table.schema
# ts: timestamp[us]
new_schema
# ts: time64[us]

table.cast(new_schema)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 1329, in pyarrow.lib.Table.cast
  File "pyarrow/table.pxi", line 277, in pyarrow.lib.ChunkedArray.cast
  File "/home/inspiron/.virtualenvs/par/lib/python3.7/site-packages/pyarrow/compute.py", line 243, in cast
    return call_function("cast", [arr], options)
  File "pyarrow/_compute.pyx", line 446, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 275, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from timestamp[us] to time64 using function cast_time64

Is there any way to make this casting possible?

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274

1 Answers1

3

time64[us] is a time of day. It represents the number of microseconds since midnight. It is not tied to any specific date and cannot be converted to a timestamp.

The Arrow docs are a bit sparse but the parquet docs explain better:

TIME

TIME is used for a logical time type without a date with millisecond or microsecond precision. The type has two type parameters: UTC adjustment (true or false) and unit (MILLIS or MICROS, NANOS).

TIME with unit MILLIS is used for millisecond precision. It must annotate an int32 that stores the number of milliseconds after midnight.

TIME with unit MICROS is used for microsecond precision. It must annotate an int64 that stores the number of microseconds after midnight.

TIME with unit NANOS is used for nanosecond precision. It must annotate an int64 that stores the number of nanoseconds after midnight.

The sort order used for TIME is signed.

Pace
  • 41,875
  • 13
  • 113
  • 156
  • Is there a way to do like cast(int(col.time()), time64('us')) – Avinash Raj Aug 13 '21 at 05:36
  • Actually I'm trying to convert 12:25 string exists in a CSV to relevant time64 object in CSV to parquet conversion – Avinash Raj Aug 13 '21 at 05:37
  • 1
    Well, the good news is that pyarrow 6.0.0 should support [parsing time strings](https://issues.apache.org/jira/browse/ARROW-11243) in CSV to time32. Casting from time32 to time64 should be doable. You can test this with a nightly build if you want. Otherwise you'll have to parse it yourself with something else. What is `int(col.time())`? If you have the microseconds since midnight you can cast from `int64` to `time64`. – Pace Aug 13 '21 at 05:56
  • Maybe [this gist](https://gist.github.com/westonpace/23fc1baee017e2aa5c9d6d5825d34bdf) will help. – Pace Aug 13 '21 at 06:04
  • Directly extracting the time component of a timestamp is not yet supported (see https://issues.apache.org/jira/browse/ARROW-13549 for a tracking issue). But as @Pace said, if you originally had a CSV file, directly parsing as time will be the better option. – joris Aug 13 '21 at 08:28
  • @Pace Even-though the other answer looks clumpsy but casting with two schemas will work, right? (ts -> int64 -> time64) – Avinash Raj Aug 13 '21 at 11:48
  • could you pls undelete the other answer? – Avinash Raj Aug 13 '21 at 11:48
  • It will not work. It will give you an invalid value. `datetime.now` cast to an integer will be the # of microseconds since the epoch. If you interpret that as the # of microseconds since midnight that will not be valid. – Pace Aug 13 '21 at 17:50
  • but I guess since we are going to convert int64 to time dt, it automatically subtracts the int midnigh timestamp from the now timestamp. So the result will be of only the integer timestamp from midnight to now. After that it simply converts that to a time dt. – Avinash Raj Aug 14 '21 at 05:00
  • ```>>> d datetime.datetime(2016, 3, 26, 23, 12, 20) >>> d1 datetime.datetime(2016, 3, 27, 1, 12, 20) >>> table = pa.Table.from_pydict({'ts': pa.array([d, d1])}) >>> table.to_pydict() {'ts': [datetime.datetime(2016, 3, 26, 23, 12, 20), datetime.datetime(2016, 3, 27, 1, 12, 20)]} >>> m ts: int64 >>> n ts: time64[us] >>> table pyarrow.Table ts: timestamp[us] >>> table.cast(m).cast(n).to_pydict() {'ts': [datetime.time(23, 12, 20), datetime.time(1, 12, 20)]} >>> ``` – Avinash Raj Aug 14 '21 at 05:05
  • 1
    Yes, I think you are right, but I'm not sure this is the correct interpretation and it may break in the future (for example, if Arrow added kernels for Time32/Time64 manipulation or started to require that time32/time64 be less than one day). I've raised a [question on the mailing list](https://lists.apache.org/thread.html/r1ca03b758c9c1679b6a90c9c1fe17b12c603215729f012f10c31bc1d%40%3Cdev.arrow.apache.org%3E) and will update my answer based on the results. – Pace Aug 16 '21 at 18:55
  • @Pace also I can't read list types from csv, it shows `pyarrow.lib.ArrowNotImplementedError: CSV conversion to list is not supported`. Does they support this feature in future? – Avinash Raj Aug 23 '21 at 13:04
  • @AvinashRaj That probably should be a new question but, no, I'm not aware of anyone currently working on such a feature and I don't see any JIRA ticket. You're welcome to create one to discuss it further. How would you expect such a list to be represented? Would it be sufficient to read the column in as a string and then split it? – Pace Aug 23 '21 at 17:57
  • Also, the discussion ended up in a [vote](https://lists.apache.org/thread.html/r858ee15f396bbdbeaf7a5a1e5b0c7ebf1b76f198ec75158a4f3230d9%40%3Cdev.arrow.apache.org%3E) where it was decided that Arrow will not support the above interpretation (casting timestamp->int64->time) so I will not add it as an answer. – Pace Aug 23 '21 at 17:59
  • @Pace Hi, I'm getting `In CSV column #10: CSV conversion error to string: invalid UTF8 data` while converting a large csv to parquet. Since it's a large file having millions of lines, I can't identify the error causing line. Is there any option in pyarrow read_csv func to report bad lines? – Avinash Raj Sep 13 '21 at 00:33
  • asked https://stackoverflow.com/questions/69156181/pyarrow-find-bad-lines-in-csv-to-parquet-conversion – Avinash Raj Sep 13 '21 at 00:41