Python 3.x - dataframe comprehension based on subset of defaultdict

Question

I have myself a default dict like so:

dict_items([('62007459', 0), ('68092193', 0), ('40224646', 120.92999999999999), ('68078141', 0), ('68061506', 0), ('60000216', 123.84), ...])

However, I will not need all of the entries. Instead, I want to use a subset of these. I have the desired keys stored as a simple native python list, a list of strings:

lcodes = ['40224646','60000216', ... ]

The following gives me a nice dataframe for the whole bit: (z_lookup is just a dict that matches two kinds of ids)

df_all = pd.DataFrame([(k, z_lookup.get(k, 'N/A'), v) for k, v in 
                          output.items()], columns=['id','zcode','values'])

I can access values one at a time, for instance: output['62007459'] gives me 0, which is correct. But I'm not sure how to scale up/iterate this process so that the pandas dataframe is only created from the keys in output that are present in lcodes.

Question

Given a defaultdict and a native python list of keys to filter by/keep, how do I use this information to tell pandas I want a dataframe from only entries in the defaultdict that are represented in a separate list of string containing keys to keep?

`df_all = pd.DataFrame([(k, z_lookup.get(k, 'N/A'), v) for k, v in output.items() if k in lcodes], columns=['id','zcode','values'])` perhaps? — Nick, Apr 10 '23 at 07:51
Or maybe `df_all = pd.DataFrame([(k, z_lookup.get(k, 'N/A'), output.get(k)) for k in set(output) & set(lcodes)], columns=['id','zcode','values'])` — Nick, Apr 10 '23 at 07:54

score 1 · Answer 1 · answered Apr 10 '23 at 08:22

If it is OK to remove the unwanted data, the easiest solution might be to just filter out items from your default dict before using it to initialise the dataframe.

After you filter out the unwanted data, you can just create the dataframe as you are doing it now; example:

# filter unwanted keys, assuming 'output' is your default dict
for lcode in lcodes:
    output.pop(lcode, None)

# create dataframe from only desired data
df = pd.DataFrame([(k, z_lookup.get(k, 'N/A'), v) for k, v in 
                          output.items()], columns=['id','zcode','values'])

While it adds two extra lines of code for filtering, imho doing it in two steps is slightly more readable.

Python 3.x - dataframe comprehension based on subset of defaultdict

Question

1 Answers1