0

I tried to merge the two columns begin and end to Flag and Timestamp with this piece of code:

print(df_DisponibilityAlarm.shape)

df_DisponibilityAlarm = (df_DisponibilityAlarm.stack()
 .rename_axis([None, 'Flag'])
 .reset_index(level=1, name='Timestamp'))

print(df_DisponibilityAlarm.shape)

The result is :

                         begin                          end
0                          NaN  2019-10-21  07:48:28.272688
1                          NaN  2019-10-21  07:48:28.449916
2  2019-10-21  07:48:26.740378                          NaN
3  2019-10-21  07:48:26.923764                          NaN
4                          NaN  2019-10-21  07:48:41.689466
5  2019-10-21  07:48:37.306045                          NaN
6                          NaN  2019-10-21  07:58:00.774449
7  2019-10-21  07:57:59.223986                          NaN
8                          NaN  2019-10-21  08:32:37.004455
9  2019-10-21  08:32:35.755252                          NaN

(13129, 2)
(13140, 2)

    Flag                    Timestamp
0    end  2019-10-21  07:48:28.272688
1    end  2019-10-21  07:48:28.449916
2  begin  2019-10-21  07:48:26.740378
3  begin  2019-10-21  07:48:26.923764
4    end  2019-10-21  07:48:41.689466
5  begin  2019-10-21  07:48:37.306045
6    end  2019-10-21  07:58:00.774449
7  begin  2019-10-21  07:57:59.223986
8    end  2019-10-21  08:32:37.004455
9  begin  2019-10-21  08:32:35.755252

It works ! But when I looked closely, I see when I use "stack()" the number of the rows increase... I don't understand why, can you explain me please ? I need this to validate my starting hypothesis.

Nadhir
  • 528
  • 3
  • 12
  • 1
    Look at the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html) about how this behaves. `stack()` will "stack" column values into long format as apposed to `unstack()`, which reshapes DataFrame into wide format. In particular, pay attention to the examples in the docs to visually see what happens. – Ukrainian-serge Mar 17 '20 at 00:37
  • @Ukrainian-serge Thanks for your help. So my question is, how to check if I have all the data since the result is different thanks to stack() ? – Nadhir Mar 17 '20 at 12:43
  • Your question is not reproducible, meaning I cant just copy and paste and reprocude your results. Read [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) on how to do that. As I copy and paste your examples it is not reproducible. – Ukrainian-serge Mar 17 '20 at 18:13
  • @Ukrainian-serge any idea ? since my new post – Nadhir Mar 27 '20 at 15:55
  • 1
    Read below answer **`.stack()` does not "stack" `begin`/`end` column values that contain `NaN`** – Ukrainian-serge Apr 29 '20 at 23:43

1 Answers1

1

EDITED

Inherently, stack() function makes your dataset longer BUT as you can see in your sample printout:

rows that contained NaN are not included after stacking. <-------------


Consider the test case below, added TEST_VALUE:

begin,end
NaN,2019-10-21  07:48:28.272688
NaN,2019-10-21  07:48:28.449916
2019-10-21  07:48:26.740378,NaN
2019-10-21  07:48:26.923764,NaN
NaN,2019-10-21  07:48:41.689466
2019-10-21  07:48:37.306045,TEST_VALUE
NaN,2019-10-21  07:58:00.774449
2019-10-21  07:57:59.223986,NaN
NaN,2019-10-21  08:32:37.004455
2019-10-21  08:32:35.755252,NaN

df = pd.read_clipboard(sep=',') 

print(df.shape)
print(df.stack().shape)
print(df.stack())

(10, 2)
(11, )
0  end      2019-10-21  07:48:28.272688
1  end      2019-10-21  07:48:28.449916
2  begin    2019-10-21  07:48:26.740378
3  begin    2019-10-21  07:48:26.923764
4  end      2019-10-21  07:48:41.689466
5  begin    2019-10-21  07:48:37.306045
   end                       TEST_VALUE
6  end      2019-10-21  07:58:00.774449
7  begin    2019-10-21  07:57:59.223986
8  end      2019-10-21  08:32:37.004455
9  begin    2019-10-21  08:32:35.755252

# this part is strange
print(df.stack()[5])
begin    2019-10-21  07:48:37.306045
end                       TEST_VALUE
dtype: object

It's pretty strange how index 5 is as compared to the rest and I don't have an answer as to why it is a type of multiindex and the rest in the Series are not, but let us continue...

As you can see this added an extra row for TEST_VALUE and AGAIN any value with NaN is not included in the stacked version.

I would "standardize" or make sure your data contains either NaN or date, or fill your NaN with some other value using fillna().

If you wanted to keep your data the way it is, you could also process your data in chunks to zero in on WHERE(in my sample its TEST_VALUE) the data is not uniformed like so:

# read df in as an iterator with chunks

df1 = pd.read_clipboard(
                    sep=',', 
                    chunksize=3  # change this to whatever chunksize you need
)

def test_chunks(df_iterator):
    """
    Function that compares original df chunk size shape to stacked chunksize shape
    Returns: original chunk where there is a mismatch in shapes
    """
    for df_chunk in df_iterator:
        original = df_chunk.shape[0]
        stacked = df_chunk.stack().shape[0]
        if original != stacked:
            return df_chunk

bad_chunk = test_chunks(df1)
print(bad_chunk)

                         begin                          end
3  2019-10-21  07:48:26.923764                          NaN
4                          NaN  2019-10-21  07:48:41.689466
5  2019-10-21  07:48:37.306045                   TEST_VALUE

So, although I cannot be sure because I don't have access to your full data, my verdict would be that it is a data uniformity issue.

I hope this is helpful.

Ukrainian-serge
  • 854
  • 7
  • 12