Filtering specific data points from a dataframe based on a given conditions

Question

I have a Dataframe like below

+----------+-------+-------+-------+-------+-------+
|   Date   | Loc 1 | Loc 2 | Loc 3 | Loc 4 | Loc 5 |
+----------+-------+-------+-------+-------+-------+
| 1-Jan-19 |    50 |     0 |    40 |    80 |    60 |
| 2-Jan-19 |    60 |    80 |    60 |    80 |    90 |
| 3-Jan-19 |    80 |    20 |     0 |    50 |    30 |
| 4-Jan-19 |    90 |    20 |    10 |    90 |    20 |
| 5-Jan-19 |    80 |     0 |    10 |    10 |     0 |
| 6-Jan-19 |   100 |    90 |   100 |     0 |    10 |
| 7-Jan-19 |    20 |    10 |    30 |    20 |     0 |
+----------+-------+-------+-------+-------+-------+

I want to extract all the data points (row label & column Label) if the value is zero and produce a new dataframe.

my desired output is as below

+--------------+----------------+
| Missing Date | Missing column |
+--------------+----------------+
| 1-Jan-19     | Loc 2          |
| 3-Jan-19     | Loc 3          |
| 5-Jan-19     | Loc 2          |
| 5-Jan-19     | Loc 5          |
| 6-Jan-19     | Loc 4          |
| 7-Jan-19     | Loc 5          |
+--------------+----------------+

Note on 5-Jan-19, there are two entries Loc 2 & Loc 5.

I know how to do this in Excel VBA. But, I'm looking for a more scalable solution with python-pandas.

so far i have attempted with the below code

import pandas as pd

df = pd.read_csv('data.csv')

new_df = pd.DataFrame(columns=['Missing Date','Missing Column'])

for c in df.columns:
    if c != 'Date':
        if df[df[c] == 0]:
            new_df.append(df[c].index, c)

I'm new to pandas. Hence, guide me how to solve this issue.

score 4 · Accepted Answer · answered Aug 22 '19 at 23:59

4

`melt` + `query`

(df.melt(id_vars='Date', var_name='Missing column')
   .query('value == 0')
   .drop(columns='value')
)

        Date Missing column
7   1-Jan-19          Loc 2
11  5-Jan-19          Loc 2
16  3-Jan-19          Loc 3
26  6-Jan-19          Loc 4
32  5-Jan-19          Loc 5
34  7-Jan-19          Loc 5

answered Aug 22 '19 at 23:59

ALollz

57,915
7
66
89

Although I managed to solve it by myself (check the answer above), I'm compelled to accept your code. Elegant & loved it.! – Sunni Aug 23 '19 at 00:30

Alexander · Answer 2 · 2019-08-23T00:12:52.313

Melt the dateframe using the date column as id_vars, then filter where the value is zero (e.g. using .loc[lambda x: x['value'] == 0]). Now it is just clean-up:

sort values on Date and Missing column
drop the value column (they all contain zeros)
rename Date to Missing Date
reset the index, dropping the original

.

df = pd.DataFrame({
    'Date': pd.date_range('2019-1-1', '2019-1-7'),
    'Loc 1': [50, 60, 80, 90, 80, 100, 20],
    'Loc 2': [0, 80, 20, 20, 0, 90, 10],
    'Loc 3': [40, 60, 0, 10, 10, 100, 30],
    'Loc 4': [80, 80, 50, 90, 10, 0, 20],
    'Loc 5': [60, 90, 30, 20, 0, 10, 0],
})

df2 = (
    df
    .melt(id_vars='Date', var_name='Missing column')
    .loc[lambda x: x['value'] == 0]
    .sort_values(['Date', 'Missing column'])
    .drop('value', axis='columns')
    .rename({'Date': 'Missing Date'})
    .reset_index(drop=True)
)
>>> df2
        Date Missing column
0 2019-01-01          Loc 2
1 2019-01-03          Loc 3
2 2019-01-05          Loc 2
3 2019-01-05          Loc 5
4 2019-01-06          Loc 4
5 2019-01-07          Loc 5

Thank you. upvoted your answer. your code is similar to @ALollz answer. but, I loved it. learnt a new method `melt`, btw, i managed to answer by myself. check the code above & give me your comments to further improve. — Sunni, Aug 23 '19 at 00:35

score 3 · Answer 3 · edited Aug 23 '19 at 00:34

I'm the crazy answer,

You can use for the dates :

new_dates = pd.np.repeat(df.index, df.eq(0).sum(axis=1).values)

Replace df.index with df['Date'] if necessary.

And for the values

cols = pd.np.where(df.eq(0), df.columns, pd.np.NaN) 
new_cols = cols[pd.notnull(cols)]

Finally,

new_df = pd.DataFrame(new_cols, index=new_dates, columns =['Missing column'])

alternatively you can create a new column instead of an index.

Now how does that work ?

new_dates takes the series and repeat each value as many times as their are True values in that row. I summed the True values over eachrow since they are equal to 1. Meaning, True when df.eq(0).

Next, I call a filter that gives the column name if the value is zero, NaN otherwise.

Finally, we only get the non NaN values and put them in an array which we end up using to build your answer.

N.B : I used that toy data as example :

df = pd.DataFrame(
    {
        "A":pd.np.random.randint(0,3,20),                                                               
        "B":pd.np.random.randint(0,3,20),
        "C":pd.np.random.randint(0,3,20), 
        "D":pd.np.random.randint(0,3,20)
    }, 
    index = pd.date_range("2019-01-01", periods=20, freq="D")
)

Thank you & upvoted . yes. your answer is also unique. I learnt a lot today. and I managed to solve it by myself. check my code above as well (very lame but i'm proud :)... thanks for the detailed answer) — Sunni, Aug 23 '19 at 00:44

score 3 · Answer 4 · answered Aug 23 '19 at 00:20

3

I managed to solve this with iterrows().

import pandas as pd
df = pd.read_csv('data.csv')

cols = ['Missing Date','Missing Column']
data_points = []

for index, row in df.iterrows():
    for c in df.columns:
        if row[c] == 0:
            data_points.append([row['Date'],c])

df_final = pd.DataFrame(df_final = pd.DataFrame(data_points, columns=cols), columns=cols)

answered Aug 23 '19 at 00:20

Sunni

365
1
11

great question, however your answer is not 'pythonian' per-se. check ALollz answer. The tools he's using are the right tools for this job. – adhg Aug 23 '19 at 00:26
yes. I accepted his answer. But, I feel proud to having solved it by myself, being a new bee in python-pandas. – Sunni Aug 23 '19 at 00:38
2

`iterrows` will be a very slow solution for larger datasets and should be avoided. – Alexander Aug 23 '19 at 00:45
@Alexander OK. well noted. – Sunni Aug 23 '19 at 00:46
@Alexander @adhg . Yes you are right. one of my other codes with `iterrows()` is very slow. Hence, I asked a new [question](https://stackoverflow.com/questions/57629485/speeding-up-pandas-code-by-replacing-iterrows). need your help once again. – Sunni Aug 23 '19 at 15:51

Filtering specific data points from a dataframe based on a given conditions

4 Answers4

melt + query

`melt` + `query`