0

I am trying to get a subset of data from a dataframe, based on two criteria, where Region is like 'nebraska' and where metric1 is not nan

The region filter works, but the nan filter does not and I am unsure why. Would appreciate some pointers on why it does not work.

Thanks!

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math
import pyodbc


data = [{ 'Region':'nebraska', 'metric1':50},
        { 'Region':'nebraska', 'metric1':np.nan},
        { 'Region':'nebraska', 'metric1':50},
        { 'Region':'nebraska', 'metric1':50},
        { 'Region':'oaklahoma', 'metric1':np.nan},
        { 'Region':'oaklahoma', 'metric1':50},
        { 'Region':'oaklahoma', 'metric1':np.nan},]

testDataset = pd.DataFrame(data)

testDataset[(testDataset.Region.str.contains('nebraska')) & (testDataset.metric1 != np.nan)]
Aftab H.
  • 1,517
  • 4
  • 13
  • 25
wilson_smyth
  • 1,202
  • 1
  • 14
  • 39
  • did the below solution help? feel free to accept if it did, or ask for clarification. – jpp Mar 10 '18 at 00:07

1 Answers1

0

This should work:

testDataset[testDataset['Region'].str.contains('nebraska', regex=False, na=False) & \
            pd.notnull(testDataset['metric1'])]

np.nan == np.nan is False (see explanation) which is why your method does not work.

I've also added regex=False, which should improve performance, and na=False so that you do not get errors for unexpected types.

jpp
  • 159,742
  • 34
  • 281
  • 339