0

I have a problem that involves converting time series from one representation to another. Each item in the time series has attributes "time", "id", and "value" (think of it as a measurement at "time" for sensor "id"). I'm storing all the items in a Pandas dataframe with columns named by the attributes.

The set of "time"s is a small set of integers (say, 32), but some of the "id"s are missing "time"s/"value"s. What I want to construct is an output dataframe with the form:

id  time0 time1 ... timeN
     val0  val1 ...  valN

where the missing "value"s are represented by NaNs.

For example, suppose the input looks like the following:

time  id   value
   0   0      13
   2   0      15
   3   0      20
   2   1      10
   3   1      12

Then, assuming the set of possible times is 0, 2, and 3, the desired output is:

  id   time0  time1  time2  time3
   0      13    NaN     15     20
   1      NaN   NaN     10     12

I'm looking for a Pythonic way to do this since there are several million rows in the input and around 1/4 million groups.

Mark Lavin
  • 1,002
  • 1
  • 15
  • 30

1 Answers1

1

You can transform your table with a pivot. If you need to handle duplicate values for index/column pairs, you can use the more general pivot_table.

For your example, the simple pivot is sufficient:

>>> df = df.pivot(index="id", columns="time", values="value")

time     0     2     3
id                    
0     13.0  15.0  20.0
1      NaN  10.0  12.0

To get the exact result from your question, you could reindex the columns to fill in the empty values, and rename the column index like this:

# add missing time columns, fill with NaNs
df = df.reindex(range(df.columns.max() + 1), axis=1)

# name them "time#"
df.columns = "time" + df.columns.astype(str)

# remove the column index name "time"
df = df.rename_axis(None, axis=1)

Final df:

    time0  time1  time2  time3
id                            
0    13.0    NaN   15.0   20.0
1     NaN    NaN   10.0   12.0
w-m
  • 10,772
  • 1
  • 42
  • 49