I have a list of Pandas DataFrame objects, df_list
, which I have loaded from csv and parquet files and which have TimeStamp objects as indices. I want to merge all of them into one DataFrame object as follows:
- The indices for each DataFrame object are unique. The indices for the merged DataFrame should be unique as well and it should be sorted by the index column.
- If an index exists in two or more of the original DataFrame objects, the column values for that index must be the same for all of them. Otherwise the script should display an error message (e.g., raise an exception). Some columns have float values and it is possible that some of those values lost precision when they were originally saved as csv, as doing so requires converting the values to text which involves rounding. If this is the case the value comparison needs to take this into account.
- Any of the original DataFrame objects is considered to cover the time period between the index of the first row and the index of the last row. I would like to know which time periods the merged dataframe object covers. This is the union of the time periods of the original DataFrame objects and may contain gaps.
How can I do this efficiently using Pandas commands?
I have tried the following:
intervals = pd.DataFrame(columns=column_name_list).set_index(index_name)
for current_df in df_list:
for index in current_df.index:
if index in intervals.index:
if current_df.loc[index] != intervals.loc[index]:
raise RuntimeError("Entries for {} do not match: {} and {}, respectively".format(repr(index), repr(current_df.loc[index]), repr(intervals.loc[index])))
intervals.loc[index] = current_df.loc[index]
but this is terribly slow, and I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-a3867a7c8aa1> in <module>
78 for index in current_df.index:
79 if index in intervals.index:
---> 80 if current_df.loc[index] != intervals.loc[index]:
81 raise RuntimeError("Entries for {} do not match: {} and {}, respectively".format(repr(index), repr(current_df.loc[index]), repr(intervals.loc[index])))
82 intervals.loc[index] = current_df.loc[index]
D:\ProgramData\Miniconda3\envs\stocks\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1327
1328 def __nonzero__(self):
-> 1329 raise ValueError(
1330 f"The truth value of a {type(self).__name__} is ambiguous. "
1331 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
so it seems I cannot compare two rows using the !=
operator. (Besides, my code currently doesn't take possible rounding errors into account and doesn't determine the covered time periods.)