Using filter function to find duplicates, None Values, and Values > 0

Question

I have a dataset(list of list) where each list in the list is a row that represents two columns( date and sample).

I first parse out the year, month, and day to create a new date object to help will gathering all the data for one day.

I then use a dictionary to store all the sampled data daily using the key as date and value as list of samples. I drop values that are None and not greater than 0.

I then removed any duplicate value out of each daily list, I couldn't figure out how to do that using the first for loop so I just used dictionary comprehension. If someone can show me how to check for None values, values > 0, and duplicates using a single for loop that would be great? I would think the filter function would help finding None, values < 0, and duplicates? I would still like to preserve my order as is but when I call list(set()) is shuffles the order for some reason?

import datetime
import itertools
from collections import defaultdict
from dateutil.parser import parse
from dateutil.tz import gettz

ds = [["Wed Feb 02 22:51:17 CST 2022", 9607377.0],
      ["Wed Feb 02 23:21:17 CST 2022", 9607507.0],
      ["Wed Feb 02 23:51:17 CST 2022", 9607637.0],
      ["Thu Feb 03 00:21:17 CST 2022", 9607766.0],
      ["Thu Feb 03 00:51:17 CST 2022", 9607896.0],
      ["Thu Feb 03 01:21:17 CST 2022", 9608026.0],
      ["Thu Feb 03 01:51:17 CST 2022", 9608158.0],
      ["Thu Feb 03 02:21:17 CST 2022", 9608289.0],
      ["Thu Feb 03 02:51:17 CST 2022", 9608421.0],
      ["Thu Feb 06 10:21:18 CST 2022", 0.0],
      ["Thu Feb 03 03:21:17 CST 2022", 9608556.0],
      ["Thu Feb 03 03:51:17 CST 2022", 9608691.0],
      ["Thu Feb 04 04:21:17 CST 2022", 9608822.0],
      ["Thu Feb 04 04:51:17 CST 2022", 9608956.0],
      ["Thu Feb 04 05:21:18 CST 2022", 9609092.0],
      ["Thu Feb 04 05:51:18 CST 2022", 9609228.0],
      ["Thu Feb 05 06:21:18 CST 2022", 9609363.0],
      ["Thu Feb 05 06:21:18 CST 2022", 9609363.0],
      ["Thu Feb 05 06:51:18 CST 2022", 9609504.0],
      ["Thu Feb 05 07:21:18 CST 2022", 9609645.0],
      ["Thu Feb 05 07:51:18 CST 2022", 9609787.0],
      ["Thu Feb 05 08:21:18 CST 2022", 9609925.0],
      ["Thu Feb 05 08:51:18 CST 2022", 9610068.0],
      ["Thu Feb 06 09:51:18 CST 2022", 9610358.0],
      ["Thu Feb 06 10:21:18 CST 2022", 9610503.0],
      ["Thu Feb 06 10:21:18 CST 2022", None],
      ["Thu Feb 06 10:51:18 CST 2022", 9610646.0]]

tz_dict = {"CST": gettz('America/Chicago')}
time_delta = datetime.timedelta(days=1)
dict1 = {}
for col in ds:
    date = parse(col[0], tzinfos=tz_dict)
    new_date = datetime.datetime(date.year, date.month, date.day)
    if col[1] is not None and col[1] > 0:
        dict1.setdefault(new_date, []).append(col[1])

dict1 = {k: list(set(v)) for k, v in dict1.items()}

your samples, do they have to be floating point numbers? since if you could use integers, duplicate detection would be easier ;-) — FObersteiner, Feb 11 '22 at 14:18
@FObersteiner yes they are floating point numbers, can you explain why it matters whether int or float, when if the number is the same it should remove duplicates? — Musclemania05, Feb 11 '22 at 16:58
because for a duplicates check, you need to determine equality. With integers, you could just put them in a set and do an "is in" check. For floats, you need to check for "almost-equality", see e.g. https://stackoverflow.com/q/5595425/10197418 — FObersteiner, Feb 11 '22 at 17:21

Using filter function to find duplicates, None Values, and Values > 0

0 Answers0