-1

I have a list of Pandas DataFrame objects, df_list, which I have loaded from csv and parquet files and which have TimeStamp objects as indices. I want to merge all of them into one DataFrame object as follows:

  • The indices for each DataFrame object are unique. The indices for the merged DataFrame should be unique as well and it should be sorted by the index column.
  • If an index exists in two or more of the original DataFrame objects, the column values for that index must be the same for all of them. Otherwise the script should display an error message (e.g., raise an exception). Some columns have float values and it is possible that some of those values lost precision when they were originally saved as csv, as doing so requires converting the values to text which involves rounding. If this is the case the value comparison needs to take this into account.
  • Any of the original DataFrame objects is considered to cover the time period between the index of the first row and the index of the last row. I would like to know which time periods the merged dataframe object covers. This is the union of the time periods of the original DataFrame objects and may contain gaps.

How can I do this efficiently using Pandas commands?

I have tried the following:

intervals = pd.DataFrame(columns=column_name_list).set_index(index_name)
for current_df in df_list:
    for index in current_df.index:
        if index in intervals.index:
            if current_df.loc[index] != intervals.loc[index]:
                raise RuntimeError("Entries for {} do not match: {} and {}, respectively".format(repr(index), repr(current_df.loc[index]), repr(intervals.loc[index])))
        intervals.loc[index] = current_df.loc[index]

but this is terribly slow, and I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-a3867a7c8aa1> in <module>
     78             for index in current_df.index:
     79                 if index in intervals.index:
---> 80                     if current_df.loc[index] != intervals.loc[index]:
     81                         raise RuntimeError("Entries for {} do not match: {} and {}, respectively".format(repr(index), repr(current_df.loc[index]), repr(intervals.loc[index])))
     82                 intervals.loc[index] = current_df.loc[index]

D:\ProgramData\Miniconda3\envs\stocks\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1327 
   1328     def __nonzero__(self):
-> 1329         raise ValueError(
   1330             f"The truth value of a {type(self).__name__} is ambiguous. "
   1331             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

so it seems I cannot compare two rows using the != operator. (Besides, my code currently doesn't take possible rounding errors into account and doesn't determine the covered time periods.)

HelloGoodbye
  • 3,624
  • 8
  • 42
  • 57

1 Answers1

1

I have tried the following:

This is looping over every element of the existing dataframe, and every element of the new dataframe. This is O(n^2), which is slow. If you can sort the data and deal with this in a single pass, it will be much faster.

How can I do this efficiently using Pandas commands?

You can do this with concat, then sort, then use Index.duplicated() with keep=False to get a boolean index on your dataframe.

Then, use that index with .loc[] to get the subset of the dataframe which has duplicated index values. Call this the dupe dataframe. Then, use drop_duplicates() to drop the rows which are exactly the same. If the dupe dataframe has a non-zero number of rows at this point, then raise an exception, because you have duplicate index values with different column values. The dupe dataframe contains the offending data.

This explanation is a little vague, sorry. You didn't include an example dataset, so I don't have anything to test against.

Some columns have float values and it is possible that some of those values lost precision when they were originally saved as csv, as doing so requires converting the values to text which involves rounding.

Here's a simpler approach: you should read the CSV in the same way that it was written out. Look at round_trip on this page.

If this is unworkable, see Dropping duplicates with less precision.

but this is terribly slow, and I get the following error:

If you call df.loc[foo], that will look up foo in the index. That gives you a Series corresponding to the row's values. If you compare two Series with == you get another Series with True/False values. In order to treat this Series as a single boolean value, you need to decide whether you're interested in any True value, or if you want it to only be True if all values are True.

e.g.

if (current_df.loc[index] != intervals.loc[index]).any():
Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • Ok, I wasn't sure whether the DataFrame stored the indices in some form of tree structure (as I have found surprisingly little information about the memory layout of a Pandas DataFrame); in that case, determining whether an index exists could have taken O(log n) time ... but on the other hand, I guess inserting the element on the right position would still tale O(n) time, so the code would still run in O(n^2) in total. – HelloGoodbye Dec 28 '20 at 22:28
  • For information about Pandas memory layout, click [here](http://www.jeffreytratner.com/slides/pandas-under-the-hood-pydata-seattle-2015.pdf). For information about time complexity of index lookup, click [here](https://stackoverflow.com/questions/58876676/what-is-the-time-complexity-of-at-and-loc-in-pandas?noredirect=1&lq=1). – Nick ODell Dec 29 '20 at 18:26