1

I am trying to convert rows with two different values to columns with dates captured under each column.

My Dataframe looks like this. Eventype 1.0 is my start date for a particular Network node & its consecutive EventType value=5 is my end date. Hence i would like to convert Eventype values to columns to find out the start date & end date.

EventID NetworkNode EventTime   EventType
1140085 606.0   2018-09-12 14:11:00 1.0
1140416 606.0   2018-09-12 16:39:00 5.0
1141105 606.0   2018-09-12 22:16:00 1.0
1141109 606.0   2018-09-12 22:19:00 5.0
1141288 421.0   2018-09-12 23:21:00 5.0
1141295 508.0   2018-09-12 23:23:00 5.0
1141568 647.0   2018-10-12 01:09:00 1.0
1141578 647.0   2018-10-12 01:12:00 5.0
1142463 461.0   2018-10-12 05:52:00 1.0
1142467 460.0   2018-10-12 05:53:00 1.0
1142468 502.0   2018-10-12 05:54:00 1.0
1142476 502.0   2018-10-12 05:57:00 5.0
1142493 461.0   2018-10-12 06:00:00 5.0
1142516 460.0   2018-10-12 06:01:00 5.0
1145299 629.0   2018-10-12 21:13:00 1.0
1145411 629.0   2018-10-12 22:16:00 5.0
1145414 629.0   2018-10-12 22:23:00 1.0
1145437 629.0   2018-10-12 22:26:00 5.0
1145437 421.0   2018-10-12 22:26:00 5.0


df = df[['EventID','NetworkNode', 'EventTime', 'EventType']].sort_values(by=['EventID'])

df = df.set_index(['NetworkNode','EventType'])['EventTime'].unstack()

I tried this code but gives error,

"ValueError: Index contains duplicate entries, cannot reshape" as Network Node has duplicate.

My desired Output should be something like this.

Value "1.0" in EventType column represents the start date & time of that event for that NetworkNode and the successive value "5.0" for the same NetworkNode will be the end time. Therfore i would like to convert these 2s row into 1 single row by its start & end time.

NetworkNode   1.0                      5.0
606.0       2018-09-12 14:11:00     2018-09-12 16:39:00
606.0           2018-09-12 22:16:00     2018-09-12 22:19:00
421.0           2018-09-12 23:21:00 2018-10-12 23:26:00
508.0                               2018-09-12 23:23:00
647.0           2018-10-12 01:09:00 2018-10-12 01:12:00
461.0           2018-10-12 05:52:00 2018-10-12 06:00:00
460.0           2018-10-12 05:53:00 2018-10-12 06:01:00
502.0           2018-10-12 05:54:00 2018-10-12 05:57:00
629.0           2018-10-12 21:13:00 2018-10-12 22:16:00
629.0           2018-10-12 22:23:00 2018-10-12 22:26:00

Please advise....

  • I think this is not a simple pivot_table issue, as there are several occurrences for each index. – Aryerez Sep 19 '19 at 11:36
  • You have not said anything on it, and your expected output does not respect it, but I would expect the end time for an event to be greater than the start time. Can you confirm whether this is (or not) a requirement? – Serge Ballesta Sep 19 '19 at 12:25
  • @jezrael: I do not think that the real answer is a pivot, because the underlying question is about start and end time of events. IMHO `merge_asof` is the key here, even if OP has not shown that he was aware of the problem. – Serge Ballesta Sep 19 '19 at 12:53
  • I have edited my actual requirement. I think i have brought clarity to my ask. Thanks for the help. – sudheep radhakrishnan Sep 19 '19 at 14:34

1 Answers1

0

So far that I can answer

The main problem is that: to form Pivot Table like that, you require Unique Index, Index cannot be duplicated, so here I have 2 Options I can share with you

1) Concatenate the EventID and NetworkNode together to make it a unique Index, and form a pivot table

data = pd.read_csv(path, encoding="ISO-8859-1")
data_cp = data.copy()
data["Node_ID"] = ""
for x in range(len(data)):
    data["Node_ID"][x] = str(data["NetworkNode"][x]) + "_" + str(data["EventID"][x])
data.pivot(index='Node_ID', columns='EventType', values='EventTime')

Result be like this when run: Index is NodeID

2) Simply do Groupby using these 2 Key -> NetworkNode and EventType (Don't need to make it an index)

data_cp = data.copy()
data_cp.drop(columns=["EventID"], inplace=True)
view = data_cp.groupby(by=['NetworkNode','EventType'])["EventTime"]
view.first()

Groupby by 2 Columns

Jovan
  • 763
  • 7
  • 26
  • This doesnt help in getting the output that need. Value "1.0" under event type represent the start time of that node and its successive value "5.0" under event type will be the end time of the same node. So the reult for each of those nodes must be in one single row with start time (1.0) and end time (5.0). Thank for the help. – sudheep radhakrishnan Sep 19 '19 at 14:27