8

I have a dataframe that contains 5 columns and I am using pandas and numpy to edit and work with the data.

id      calv1      calv2      calv3      calv4 
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29
2         NaT        NaT        NaT        NaT         
3  2006-08-29        NaT        NaT        NaT
4  2006-08-29 2007-08-29 2010-08-29        NaT
5  2006-08-29 2013-08-29        NaT        NaT
6  2006-08-29        NaT 2013-08-29 2013-08-292

I want to create another column that counts the number of "calv" that occur for each id. However it matters to me if there are missing values inbetween other values, see row 6. Then I want there to be a NaN or perhaps some other value indicating this is not a correct row.

id      calv1      calv2      calv3      calv4 no_calv
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29       4
2         NaT        NaT        NaT        NaT       0 
3  2006-08-29        NaT        NaT        NaT       1
4  2006-08-29 2007-08-29 2010-08-29        NaT       3
5  2006-08-29 2013-08-29        NaT        NaT       2
6  2006-08-29        NaT 2013-08-29 2013-08-292     NaN    #or some other value

Here is my last attempt:

nat = np.datetime64('NaT')

df.loc[
(df["calv1"] == nat) & (df["calv2"] == nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 0
#1 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] == nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 1
#2 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 2
#3 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] != nat) & (df["calv4"] == nat),
"no_calv"] = 3
#4 or more calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] != nat) & (df["calv4"] != nat),
"no_calv"] = 4

But the result is that the whole "no_calv" column is 4.0

I previously tried things like

..
(df["calv1"] != "NaT")
..

And

..
(df["calv1"] != pd.nat)
..

And the result was always 4.0 for the whole column or just NaN. I can't seem to find a way of telling python what the NaT values are?

Any tips and tricks for a new python user? I've done this both in SAS and in Fortran using if and elseif statements but I am trying to find the best way to do this in Python.

Edit: I'm really curious to know if this can be done by if or ifelse statements.

And now I'm also thinking I would like to be able to have other columns in the dataframe that contain extra info but are not needed for this exact purpose. An example (an added yx column):

id yx       calv1      calv2      calv3      calv4 no_calv
1  27  2006-08-29 2007-08-29 2008-08-29 2009-08-29       4
2  34         NaT        NaT        NaT        NaT       0 
3  89  2006-08-29        NaT        NaT        NaT       1
4  23  2006-08-29 2007-08-29 2010-08-29        NaT       3
5  11  2006-08-29 2013-08-29        NaT        NaT       2
6  43  2006-08-29        NaT 2013-08-29 2013-08-292     NaN    #or some other value
Thordis
  • 87
  • 1
  • 11

4 Answers4

4

Another way of doing it using pd.Series.last_valid_index and pd.DataFrame.count:

>>> df2  = df.copy()
>>> df2.columns = np.arange(df2.shape[1]) + 1
>>> mask = (df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1))
>>> df.loc[mask, 'no_calv'] = df.notna().sum(1)
>>> df
         calv1       calv2       calv3        calv4  no_calv
id                                                          
1   2006-08-29  2007-08-29  2008-08-29   2009-08-29      4.0
2          NaN         NaN         NaN          NaN      0.0
3   2006-08-29         NaN         NaN          NaN      1.0
4   2006-08-29  2007-08-29  2010-08-29          NaN      3.0
5   2006-08-29  2013-08-29         NaN          NaN      2.0
6   2006-08-29         NaN  2013-08-29  2013-08-292      NaN

Explanation:

pd.Series.last_valid_index returns the position of last valid data in a series. Applying it on your rows will tell the column positions where last valid data is (after which there are all NaNs/NaTs).

Below I temporarily replaced the column names with integer indices and then applied pd.Series.last_valid_index on each row:

>>> df2.columns = np.arange(df2.shape[1]) + 1
>>> df2
             1           2           3            4
id                                                 
1   2006-08-29  2007-08-29  2008-08-29   2009-08-29
2          NaN         NaN         NaN          NaN
3   2006-08-29         NaN         NaN          NaN
4   2006-08-29  2007-08-29  2010-08-29          NaN
5   2006-08-29  2013-08-29         NaN          NaN
6   2006-08-29         NaN  2013-08-29  2013-08-292

>>> df2.apply(pd.Series.last_valid_index, axis=1).fillna(0)
id
1    4.0
2    0.0
3    1.0
4    3.0
5    2.0
6    4.0
dtype: float64

So on row 1, last valid data is in column 4, on row 2 there is no valid data, and so on.

Now let's count no. of valid data in each row:

>>> df2.count(axis=1)
id
1    4
2    0
3    1
4    3
5    2
6    3
dtype: int64

So, on row 1, there are 4 valid values, on row 2 no valid values, and so on. Now if all NaN/NaT values are towards the end of the row, the counts should match last valid data position we calculated above:

>>> df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1)
id
1     True
2     True
3     True
4     True
5     True
6    False
dtype: bool

So as seen, it matches on all rows except the last, because NaT appears in the middle of valid values in last row. We can use this as mask, and then fill the sum:

>>> mask = (df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1))
>>> df.loc[mask, 'no_calv'] = df.notna().sum(1)
>>> df
         calv1       calv2       calv3        calv4  no_calv
id                                                          
1   2006-08-29  2007-08-29  2008-08-29   2009-08-29      4.0
2          NaN         NaN         NaN          NaN      0.0
3   2006-08-29         NaN         NaN          NaN      1.0
4   2006-08-29  2007-08-29  2010-08-29          NaN      3.0
5   2006-08-29  2013-08-29         NaN          NaN      2.0
6   2006-08-29         NaN  2013-08-29  2013-08-292      NaN
Ank
  • 1,704
  • 9
  • 14
  • Thanks for this. I tried it but it takes a really long time because the dataframe is 400.000 rows. How would you do this method if the dataframe contained more columns then the ones described and I don't want to include them in the no_calv count? – Thordis Jun 10 '21 at 17:53
  • You can first select the columns you want to operate on in a different df. Like `df_temp = df.loc[:, ['calv1','calv2','calv3','calv4']]`. Then perform above steps using `df_temp` as your main df. in the end, just assign `no_calv` column in `df_temp` back to original `df`: `df['no_calv'] = df_temp['no_calv']`. – Ank Jun 10 '21 at 18:04
  • I believe the slowness was due to df.transpose I was doing when calculating counts. It will be slow on large dfs. Changed it to `df2.count(axis=1)`. Should improve the performance now. – Ank Jun 10 '21 at 18:35
1

To test if a value is NaT, use pd.isnull as shown in this answer. isnull matches None, NaN, and NaT.

You can build a function which does this check and sums all of the values until it hits a null value. For example:

import io
import numpy as np
import pandas as pd
df = pd.read_fwf(io.StringIO("""calv1      calv2      calv3      calv4 
2006-08-29 2007-08-29 2008-08-29 2009-08-29
       NaT        NaT        NaT        NaT         
2006-08-29        NaT        NaT        NaT
2006-08-29 2007-08-29 2010-08-29        NaT
2006-08-29 2013-08-29        NaT        NaT
2006-08-29        NaT 2013-08-29 2013-08-292"""))
df = df.replace("NaT", pd.NaT)

def count_non_nat(row):
    count = 0
    for i in row:
        if pd.isnull(i):
            if count < len(row.dropna()):
                return np.nan
            return count
        count += 1
    return count

# Apply this function row-wise (axis=1)
df['count'] = df.apply(count_non_nat, axis=1)

The output is a new column:

  calv1      calv2      calv3      calv4       count
0 2006-08-29 2007-08-29 2008-08-29 2009-08-29  4
1 NaT        NaT        NaT        NaT         0
2 2006-08-29 NaT        NaT        NaT         1
3 2006-08-29 2007-08-29 2010-08-29 NaT         3
4 2006-08-29 2013-08-29 NaT        NaT         2
5 2006-08-29 NaT        2013-08-29 2013-08-292 NaN
SNygard
  • 916
  • 1
  • 9
  • 21
  • Thanks for this. But I don't want the count in the last row (which I consider a faulty line) to be 1 so it will be the same as a "correct" row with only values for calv1. – Thordis Jun 10 '21 at 16:26
  • Oops, I missed that requirement the first time. I added a check, if the reported count is smaller than `len(row.dropna())` then there are values after the `NaT` with reported dates. In that case, we return `np.nan` to the dataframe. – SNygard Jun 10 '21 at 19:08
1

You can try the following, with df.interpolate:

>>> numeric = df.apply(lambda col: col.dt.day, axis=1)
# convert to something other than datetime

    calv1  calv2  calv3  calv4
id                            
1    29.0   29.0   29.0   29.0
2     NaN    NaN    NaN    NaN
3    29.0    NaN    NaN    NaN
4    29.0   29.0   29.0    NaN
5    29.0   29.0    NaN    NaN
6    29.0    NaN   29.0   29.0

>>> mask = (
        numeric.isna() != numeric.interpolate(limit_area='inside', axis=1).isna()
    ).any(1)
>>> mask
id
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

>>> df.loc[~mask, 'no_calv'] = df.notna().sum(1)
# Or,
# df['no_calv'] = np.where(mask, np.nan, df.notna().sum(1))
>>> df

        calv1      calv2      calv3      calv4  no_calv
id                                                     
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29      4.0
2         NaT        NaT        NaT        NaT      0.0
3  2006-08-29        NaT        NaT        NaT      1.0
4  2006-08-29 2007-08-29 2010-08-29        NaT      3.0
5  2006-08-29 2013-08-29        NaT        NaT      2.0
6  2006-08-29        NaT 2013-08-29 2013-08-29      NaN

What interpolate(limit_area='inside') does is, it only fills nans if there are valid values at either side. For example:

>>> numeric
    calv1  calv2  calv3  calv4
id                            
1    29.0   29.0   29.0   29.0
2     NaN    NaN    NaN    NaN
3    29.0    NaN    NaN    NaN
4    29.0   29.0   29.0    NaN
5    29.0   29.0    NaN    NaN
6    29.0    NaN   29.0   29.0

>>> numeric.interpolate(limit_area='inside', axis=1)
    calv1  calv2  calv3  calv4
id                            
1    29.0   29.0   29.0   29.0
2     NaN    NaN    NaN    NaN
3    29.0    NaN    NaN    NaN
4    29.0   29.0   29.0    NaN
5    29.0   29.0    NaN    NaN
6    29.0   29.0   29.0   29.0
             ^
   Only this on is filled

So if we compare which nan values from numeric do not match with interpolated numeric, we can find the rows where there are nan values in between valid values.

Sayandip Dutta
  • 15,602
  • 4
  • 23
  • 52
1
# get the columns that has "calv" in their name
calv_like = df.filter(like="calv")

# get a boolean frame of whether NaN or not
nan_mask = calv_like.isna().to_numpy()

# get the rows that has NaN in between
bad_rows = np.any(nan_mask[:, 1:] < nan_mask[:, :-1], axis=1)

# form the new column with numpy's if-else
df["no_calv"] = np.where(bad_rows,
                         -1,
                         len(calv_like.columns) - nan_mask.sum(axis=1))

After getting calv-like columns, we form a NaN mask over it. Then, to detect "bad" rows, we look for those that have an increase in NaN-ness in consecutive cells. That is, if we see a change from NaN to non-NaN, that row is bad. Finally, np.where (numpy's if-else) helps us form the new column: is it a bad row? then put -1 there. If not, put number of non-NaN's in each row,

to get

   id  yx       calv1       calv2       calv3        calv4  no_calv
0   1  27  2006-08-29  2007-08-29  2008-08-29   2009-08-29        4
1   2  34         NaT         NaT         NaT          NaT        0
2   3  89  2006-08-29         NaT         NaT          NaT        1
3   4  23  2006-08-29  2007-08-29  2010-08-29          NaT        3
4   5  11  2006-08-29  2013-08-29         NaT          NaT        2
5   6  43  2006-08-29         NaT  2013-08-29  2013-08-292       -1
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38