I have a CSV file in which one of the columns contains a range of real numbers with a 0.1 step interval. However, some of the values are missing. I've loaded the data into a pandas DataFrame and set this column as the index. I am having difficulty identifying the missing index values due to the way floating point numbers are represented. I would like to find the missing rows and fill them with np.nan
.
For example, given
A B
0.0 78.4 65.9
0.1 72.9 29.2
0.3 80.7 45.9
0.4 10.9 82.3
0.5 37.7 85.4
0.6 90.4 69.9
0.7 17.0 57.2
0.8 70.0 49.6
1.0 58.6 65.1
I would like to get
A B
0.0 78.4 65.9
0.1 72.9 29.2
0.2 NaN NaN
0.3 80.7 45.9
0.4 10.9 82.3
0.5 37.7 85.4
0.6 90.4 69.9
0.7 17.0 57.2
0.8 70.0 49.6
0.9 NaN NaN
1.0 58.6 65.1
I have tried using pandas.DataFrame.reindex, but it's not working due to floating point representation. The following following is what I've tried so far (my data has a much larger range):
import sys
import numpy as np
import pandas as pd
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
data = """, A, B
0.0, 78.4, 65.9
0.1, 72.9, 29.2
0.3, 80.7, 45.9
0.4, 10.9, 82.3
0.5, 37.7, 85.4
0.6, 90.4, 69.9
0.7, 17.0, 57.2
0.8, 70.0, 49.6
1.0, 58.6, 65.1"""
fp = StringIO(data)
df = pd.read_csv(fp, index_col=0)
print(
df.reindex(
pd.Index(
np.arange(df.index[0], df.index[-1]+0.1, 0.1)
)
)
)
This outputs
A B
0.0 78.4 65.9
0.1 72.9 29.2
0.2 NaN NaN
0.3 NaN NaN
0.4 10.9 82.3
0.5 37.7 85.4
0.6 NaN NaN
0.7 NaN NaN
0.8 70.0 49.6
0.9 NaN NaN
1.0 58.6 65.1