0

I have some input CSV from which I am building a second dataframe based on certain conditions. Part of the CSV looks like this:

Input

│       Date       │ Instrument │  Value  │
├──────────────────┼────────────┼─────────┤
│ 2018-09-01 00:05 │ Foo        │ -0.3552 │
│ 2018-09-01 00:05 │ Bar        │ -0.4423 │
│ 2018-09-01 00:05 │ Foo        │ -0.6511 │
│ 2018-09-01 00:05 │ Bar        │ -0.8773 │
│ 2018-09-01 00:15 │ Foo        │  0.0045 │
│ 2018-09-01 00:15 │ Bar        │  0.0904 │
│ 2018-09-01 00:20 │ Foo        │  0.6691 │
│ 2018-09-01 00:20 │ Bar        │  0.7401 │

In this step, I am trying to create a dataframe indexed by time that should look like this:

Output

│                  │ Column_Foo │ Column_Bar │
├──────────────────┼────────────┼────────────┤
│ 2018-09-01 00:05 │    -1.0063 │    -1.3196 │
│ 2018-09-01 00:15 │     0.0045 │     0.0904 │
│ 2018-09-01 00:20 │     0.6691 │     0.7401 │

Essentially, I want to sum together any values that occur on the same datetime in the CSV input.

I can use newDf.loc[index, column] = value to add value to a new cell in location index, column, but I see no way of appending value to the number in the cell if it already contains a number.

newDf.loc[index, column] += value fails with a KeyError if the index or column doesn't exist yet (even if you put it inside a pd.isnull()!). And if they both do exist but the cell is empty, newDf.loc[index, column] returns NaN which together with += just yields NaN.

My attempt

My best attempt so far, which works but I really hope there's a cleaner way to do this.

def awfulWayOfDoingThings(row, out):
    if row.Date in out.index and "Column_" + row.Instrument in out.columns:
        if np.isnan(out.at[row.Date, "Column_" + row.Instrument]):
            out.at[row.Date, "Column_" + row.Instrument] = 0.0
        out.at[row.Date, "Column_" + row.Instrument] += row.Value
    else:
        out.at[row.Date, "Column_" + row.Instrument] = row.Value

newDf = pd.DataFrame()
inputDf.apply(awfulWayOfDoingThings, axis=1, args=[newDf])

I essentially have to check:

  1. Does the index exist already?
  2. Does the column exist already?
  3. Is the cell NaN?

I really think I must be getting something wrong here and that there must be a significantly simpler way to do this.

What I would like to do

def lovelyWayOfDoingThings(row, out):
    out.at[row.Date, "Column_" + row.Instrument] += row.Value

newDf = pd.DataFrame()
inputDf.apply(lovelyWayOfDoingThings, axis=1, args=[newDf])
Doggie52
  • 153
  • 6

0 Answers0