inserting missing fields in pandas dataframe with duplicated indexes

Question

I have a similar problem to link.

The solutions there seem not to work in my particular case. I am thinking this is because the size of the dataframes.

I have a df of size (2,018,901 - 6) and a list of length 2,083,656. the purpose of the list (as in the link above) is to insert missing time slots in the original df.

I tried the accepted solution of the link but the programs keeps running endlessly. The second solution, i.e.,

 df.merge(how='right', on='A', right = pd.DataFrame({'A':my_list})).sort(columns='A').reset_index().drop(['index'], axis=1)

Throws this error:

***MemoryError:

To give some background of my problem, I have market clearing prices stored in a df. For every time slot (ts) -1 per hour- there are 24 open trading sessions, corresponding to the next 24 hours. On some ts the market might not clear and thus I have a gap.

For example, let's say, it's currently ts 1. At this point in time the next 24 ts are open (from ts 2 to ts 25). At current ts 2, the open ts's would be from 3 to 26 and so on. So in the data I have a row for each ts (btw, also repeated ts). When at some ts there was no clearing then there is a jump in the data.

I want to fill in the gaps with

my_list

This because I am further processing the data with other sources with no gaps at any ts.

This is a piece of df:

       A     1     2 bin bin_t1  bin_dum
89 381.0   0.6 153.8  11      7      7.0
90 382.0   0.5  59.4   7      7      7.0
91 383.0   0.5  53.4   7      6      6.0
92 384.0   0.4  45.9   6      6      6.0
93 385.0   0.4  46.5   6      6      6.0
94 386.0   0.3  45.4   6      6      6.0
95 387.0 740.8  34.1   6      5      5.0
96 365.0   0.1   2.4   5      5      5.0
97 366.0   0.1   0.5   5      6      6.0
98 388.0 732.8  38.5   6      5      5.0
99 366.0   1.7   2.7   5      5      5.0

Column A stands for the open ts. There is a jump from row 97 to 98, i.e., from 366 to 388. So in my_list, I have the complete sequence, with the missing ts, in this case

367 368 369 370 ... 387

The idea is to make the df bigger by completing the broken sequences and fill the rest of the columns with zeros.

What I've tried, besides the previous line of code, is

df.set_index('A')
new_index = pd.Index(my_list, name='A')
df.set_index('A').reindex(new_index)
df.set_index('A').reindex(new_index).reset_index()

This approach just keeps running.
The shape of df and my_list is respectively

[2018901 rows x 6 columns]
[1 row x 2083656 columns]

How can I handle this?

Thanks

EDIT

The initial data is coming from multiple files. To eliminate the size suspicion, I tried doing the solution mentioned before in a dataframe composed of a single file.

Now I get this error, similar to case:

*** ValueError: cannot reindex from a duplicate axis

I can't spare eliminating the duplicated values. Is there a work around?

Thanks.

Its always better to give the example data or data rather than explaining your dataset — Bharath M Shetty, Jul 27 '17 at 10:42

inserting missing fields in pandas dataframe with duplicated indexes

0 Answers0