1

I have a dataframe as below.

+----------+-------+-----+-------+-------+-----+
|   Date   |   A   |  B  |   C   |   D   |  E  |
+----------+-------+-----+-------+-------+-----+
| 1-Jan-21 | 1,089 |     |       |  195  |     |
+----------+-------+-----+-------+-------+-----+
| 2-Jan-21 |  609  | 547 |  491  | 1,091 |     |
+----------+-------+-----+-------+-------+-----+
| 3-Jan-21 | 1,440 | 824 |  184  |  512  | 665 |
+----------+-------+-----+-------+-------+-----+
| 4-Jan-21 |  704  | 632 |  345  |       | 969 |
+----------+-------+-----+-------+-------+-----+
| 5-Jan-21 |  297  |     | 1,426 |       | 555 |
+----------+-------+-----+-------+-------+-----+

df = pd.DataFrame({
    'Date' : ['1-Jan-21','2-Jan-21','3-Jan-21','4-Jan-21','5-Jan-21'],
    'A': [1089, 609, 1440, 704, 297],
    'B': [np.nan, 547, 824, 632, np.nan],
    'C': [np.nan, 491, 184, 345, 1426],
    'D': [195, 1091, 512, np.nan, np.nan],
    'E': [np.nan, np.nan, 665, 969, 555]
})

I want to ignore the Nan values and bring down all the values to a single base/floor and rename the Date values as t, t-1,t-2 and so on.

my desired output is as below.

+------+-------+-----+-------+-------+-----+
| time |   A   |  B  |   C   |   D   |  E  |
+------+-------+-----+-------+-------+-----+
|  t-4 | 1,089 |     |       |       |     |
+------+-------+-----+-------+-------+-----+
|  t-3 |  609  |     |  491  |       |     |
+------+-------+-----+-------+-------+-----+
|  t-2 | 1,440 | 547 |  184  |  195  | 665 |
+------+-------+-----+-------+-------+-----+
|  t-1 |  704  | 824 |  345  | 1,091 | 969 |
+------+-------+-----+-------+-------+-----+
|   t  |  297  | 632 | 1,426 |  512  | 555 |
+------+-------+-----+-------+-------+-----+

how should I approach to solve this in pandas or numpy?

Tommy
  • 515
  • 4
  • 13

2 Answers2

3

You can sorting values by sorted per columns with keys parameter and then change index by list comprehension:

df = df.set_index('Date').apply(lambda x: pd.Series(sorted(x, key=pd.notna)))
df.index = [f't-{x}' if x!=0 else 't' for x in range(len(df)-1, -1, -1)]

print (df)
        A      B       C       D      E
t-4  1089    NaN     NaN     NaN    NaN
t-3   609    NaN   491.0     NaN    NaN
t-2  1440  547.0   184.0   195.0  665.0
t-1   704  824.0   345.0  1091.0  969.0
t     297  632.0  1426.0   512.0  555.0

In numpy is possible use very nice @Divakar function:

df = df.drop('Date', axis=1)
#https://stackoverflow.com/a/44559180/2901002
df = pd.DataFrame(justify(df.to_numpy(), invalid_val=np.nan, axis=0, side='down'), 
                  columns=df.columns)

df.index = [f't-{x}' if x!=0 else 't' for x in range(len(df)-1, -1, -1)]

print (df)
          A      B       C       D      E
t-4  1089.0    NaN     NaN     NaN    NaN
t-3   609.0    NaN   491.0     NaN    NaN
t-2  1440.0  547.0   184.0   195.0  665.0
t-1   704.0  824.0   345.0  1091.0  969.0
t     297.0  632.0  1426.0   512.0  555.0

If need first column Time use DataFrame.insert:

L = [f't-{x}' if x!=0 else 't' for x in range(len(df)-1, -1, -1)]

df.insert(0, 'Time', L)
print (df)
  Time       A      B       C       D      E
0  t-4  1089.0    NaN     NaN     NaN    NaN
1  t-3   609.0    NaN   491.0     NaN    NaN
2  t-2  1440.0  547.0   184.0   195.0  665.0
3  t-1   704.0  824.0   345.0  1091.0  969.0
4    t   297.0  632.0  1426.0   512.0  555.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This works like a charm, Accepted and upvoted. But, I'm unable to comprehend the `sorted` method in the `pandas` answer. even though we're sorting, the final result is not sorted. understood correct? – Tommy Feb 12 '21 at 08:00
  • 1
    @Tommy - It should be sorted by `key` parameter, which is testing not missing values, so it should working correct, more info [here](https://docs.python.org/3/howto/sorting.html#key-functions) – jezrael Feb 12 '21 at 08:03
  • Hi, what is the advantage of having `pd.Series()` method in the first line? answer is same without it as well. – Tommy Feb 19 '21 at 15:05
  • 1
    @Tommy Be free omit it, I think necesarry only for oldier pandas versions. – jezrael Feb 19 '21 at 17:54
2

Use simple approach

  • per series dropna()
  • concat to start to make series consistent length
df.Date = pd.to_datetime(df.Date)
df = df.assign(**{c:np.concatenate([np.full(df[c].isna().sum(), np.nan) , df[c].dropna()])
             for c in df.select_dtypes(include=['int64',"float64"]).columns})

Date A B C D E
0 2021-01-01 00:00:00 1089 nan nan nan nan
1 2021-01-02 00:00:00 609 nan 491 nan nan
2 2021-01-03 00:00:00 1440 547 184 195 665
3 2021-01-04 00:00:00 704 824 345 1091 969
4 2021-01-05 00:00:00 297 632 1426 512 555
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30