1

I have several large dataframes which are built up from a vehicle log. As only one message can be present on the CAN bus (vehicle communication protocol) at any time.

This is a simlipied dataframe without any interpolation:

time    messageA1    messageA2    messageA3    messageB1    messageB2    message C1    messageC2
0       1            2            1            NaN          NaN          NaN           NaN
1       NaN          NaN          NaN          NaN          NaN          3             2
2       NaN          NaN          NaN          3            7            NaN           NaN

And this can continue for millions of rows with NaN values consisting of about 95% of the entire dataframe. I have read that when a NaN/Null/None value is within a dataframe it is float64 value.

My questions:

  1. Is a float64 value allocated for every NaN value?
  2. If yes, does it do this memory efficiently?
  3. Will having a large dataframe, with 95% of it NaN values, be inefficient when it comes to process performance?
RMRiver
  • 625
  • 1
  • 5
  • 19
  • 1
    Why not use a sparse structure instead? – cs95 Apr 28 '20 at 08:38
  • 1
    Please refer to the [spare reference](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html), it describes exactly the scenario you are investigating – Daemon Painter Apr 28 '20 at 08:40
  • 1
    If the `dtype` is any numeric type (or structured type) then it is essentially a primitive buffer. So yes, a float value is allocated for every `nan`. It is the same with any `float64` value or numeric dtype. – juanpa.arrivillaga Apr 28 '20 at 08:44

1 Answers1

1

Is a float64 value allocated for every NaN value?

Yes it is;

If yes, does it do this memory efficiently?

No it does not, instead you are supposed to use a sparse data structure;

Will having a large dataframe, with 95% of it NaN values, be inefficient when it comes to process performance?

Yes it will, on all those operations that are O(f(N)), depending on the f(N). Think of you averaging data, for instance. You'll have to check if any is NaN, then don't use it (or maybe consider it 0, it depends) and this is just overhead.

You might want to compare the shear size of dense (your current implementation) against spares data structures in your case:

'dense : {:0.2f} Kbytes'.format(df.memory_usage().sum() / 1e3)
'sparse: {:0.2f} Kbytes'.format(sdf.memory_usage().sum() / 1e3)

The two numbers should be pretty different

Daemon Painter
  • 3,208
  • 3
  • 29
  • 44
  • 1
    Just so I understand - Why do you divide by 1e3? – RMRiver Apr 28 '20 at 11:40
  • 1
    I copy/pasted it without thinking, but the [memory_usage](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html) function returns bytes (for each index, see reference), so dividing by 1k returns KBytes instead. I'll edit my answer to correct the typo, thanks! – Daemon Painter Apr 28 '20 at 12:55