0

I found that we can create date-time columns in a Pandas DataFrame by doing this:

>>> dt1 = pandas.DatetimeIndex(["2016-03-04 15:01:49",
                                "2016-03-05 23:54:22",
                                "2016-04-03 21:22:08",
                                "2016-04-03 21:22:08",
                                "2016-03-05 23:54:22"])
>>> df1 = pandas.DataFrame([["firefly", 37],
                            ["wood", 47],
                            ["snowflake", 12],
                            ["waterfall", 67],
                            ["wind", 208]],
                           columns = ["what", "count"])
>>> df1['when_last'] = dt1
df1
        what  count           when_last
0    firefly     37 2016-03-04 15:01:49
1       wood     47 2016-03-05 23:54:22
2  snowflake     12 2016-04-03 21:22:08
3  waterfall     67 2016-04-03 21:22:08
4       wind    208 2016-03-05 23:54:22

This is my question: Is this a legal construct? Part of my confusion is this: is DatetimeIndex supposed to be able to accomodate duplicate dates and unordered dates, when we don't make that an index?

This is my use case that precipitates the experiment above: I have a table that I want to process using Pandas, that has many (but not too many) fields, about 40s of them. The table itself contain tens of thousands of records or more. The original format of this table is text CSV. The processing will be basically along the line of SQL-like analytics (filter, join, sort, etc), for which Pandas have decent capabilities. Among these fields there are several date-time fields (stored as UNIX timestamps in the CSV file), three or four of them. None of these can be good to use as an index of the Dataframe rows; they are dates related to several events belonging to a record, and they can have duplicates, since events can be stamped with exactly the same date-time values.

Several stackoverflow users have suggested that directly parsing date-time with read_csv with date_parser argument is actually quite poor (and perhaps performance is also mediocre) if we parse the date one-by-one, like this one. Given that the raw columns contain simply UNIX timestamps, we should be able to get high performance. The other problem is that to_datetime does not support timezone to ascribe to the UNIX timestamps. The example above doesn't have timezone, but I want to include it in my real case.

Community
  • 1
  • 1
Wirawan Purwanto
  • 3,613
  • 3
  • 28
  • 28

1 Answers1

0

Is this a legal construct?

yes.

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 1
    I wonder who gives the -1 score on this answer. If you (the giver) don't agree, please make another answer to refute this answer. – Wirawan Purwanto Jul 05 '16 at 16:46