0

I have a dataset that collects data from text. This is time series data and sometimes it is in error. I need to remove rows from them by condition: when the number is between two numbers that are greater than it. You also need to take into account the date, which may coincide with the value from the string. Here is an example entry and what I want to get (dataframe):

     date        infected_in_all

369  14-01-2021  722487.0
370  14-01-2021  19.0
371  13-01-2021  715438.0
546  29-07-2020  176437.0
547  28-07-2020  1.0
548  28-07-2020  175812.0
549  27-07-2020  1.0
550  27-07-2020  174985.0
551  26-07-2020  174721.0


     date       infected_in_all

369  14-01-2021  722487.0
371  13-01-2021  715438.0
546  29-07-2020  176437.0
548  28-07-2020  175812.0
550  27-07-2020  174985.0
551  26-07-2020  174721.0

Need help writing a condition.

kostya ivanov
  • 613
  • 5
  • 15
  • `df[df['infected_in_all'].ne(1)]` – mozway Feb 08 '22 at 11:33
  • @mozway, This won't help as sometimes the values are not 1 but some other number, updated the question – kostya ivanov Feb 08 '22 at 11:35
  • `threshold = 100 ; df[df['infected_in_all'].ge(threshold)]` ? – mozway Feb 08 '22 at 11:36
  • or `groupby('date')` + select the max, I think this is an already widely covered topic ;) Check the second duplicate – mozway Feb 08 '22 at 11:36
  • @mozway, It’s also not quite right, here is a time series, from the beginning of the spread of the infection, in the first days the values ​​are very small) For good, you need to rewrite the stemming, but the deadlines are running out – kostya ivanov Feb 08 '22 at 11:38
  • so how do you unambiguously define an outlier? can there be many valid dates? – mozway Feb 08 '22 at 11:40
  • *"You also need to take into account the date.."* What happened when *you* took that into account? What is your attempt at this, that is? –  Feb 08 '22 at 11:42
  • 1
    @kostyaivanov OK, I'll provide you an example, but you should clarify the logic for filtering (e.g. providing a threshold) – mozway Feb 08 '22 at 11:45
  • @mozway, i'm bad, everything worked out) Found a solution in similar places, one of them worked, thanks for taking the time – kostya ivanov Feb 08 '22 at 11:49
  • @kostyaivanov then it might be valuable to provide this solution as answer here ;) – mozway Feb 08 '22 at 11:51

1 Answers1

1

You can use a rolling mean to filter the data points relative the days around:

# ensure datetime type
df['date'] = pd.to_datetime(df['date'])

# computing a rolling mean on ± 1 day
roll = (df.set_index('date')['infected_in_all']
          .rolling('3d', center=True,  min_periods=1).mean()
          .bfill()
          .values # as array 
        )

# Is the value within 50% of the values around (arbitrary threshold for example)
df2 = df[df['infected_in_all'].div(roll).gt(0.5)]

output:

          date  infected_in_all
369 2021-01-14         722487.0
371 2021-01-13         715438.0
546 2020-07-29         176437.0
548 2020-07-28         175812.0
550 2020-07-27         174985.0
551 2020-07-26         174721.0
mozway
  • 194,879
  • 13
  • 39
  • 75