2

I have a project including a big dataset and I am trying to extract some values, and this values depend on the input which is given by the user, that means it differs from one to another. My dataframe has 16 columns including species, weight, population, color, locations etc. and looks like (I made a small example for this question)

df = pd.DataFrame({'species': ['bear', 'bear', 'marsupial', 'polar bear', 'bear', 'polar bear'],
                  'weights': [350, 350, 140, 450, 350, 540],
                  'population': [1864, 22000, 80000, 3000, 7864, 5000],
                  'color': ['brown', 'black', 'gray', 'white', 'brown', 'brown'],
                  'locations': ['US', 'Canada', 'Canada', 'Russia', 'US', 'Canada']})
output: 
   color locations  population species  weights
0  brown        US        1864    bear      350
1  black    Canada       22000    bear      350
4  brown        US        7864    bear      350

I ask from the user which features do you want to see, and I return the name/s from the dataframe. I collect the user input as a dictionary:

dict = {
  species: bear,
  weights: 350
}

and I can get the names manually by using loc

df_loc = df.loc[(df['weights'] == 350) & (df['species'] == 'bear')]

output:
color locations  population species  weights
0  brown        US        1864    bear      350
1  black    Canada       22000    bear      350
4  brown        US        7864    bear      350

Til now, it is fine, but I couldn't figure out a way to put df.loc[] in a loop through the dictionary and which can automatize the search process, since the size and the keys-values in this dictionary can change all the time.

Any ideas?

cancan
  • 131
  • 1
  • 3
  • 11

4 Answers4

2

Dont use variable dict, because python code word (builtin).

d = {
  'species': 'bear',
  'weights': 350
}

Solution with DataFrame.merge and DataFrame constructor, if no on parameter it is merge by intersection of both df columns names:

df1 = pd.DataFrame([d]).merge(df)

Or using query:

df1 = df.query(' & '.join(['{}=={}'.format(i,repr(j)) for i, j in d.items()]))

Solution with list comprehension and np.logical_and + reduce :

df1 = df[np.logical_and.reduce([df[k] == v for k, v in d.items()])]

print (df1)
  species  weights  population  color locations
0    bear      350        1864  brown        US
1    bear      350       22000  black    Canada
4    bear      350        7864  brown        US
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

You can define a function which accepts the features dictionary provided by the user and returns the required dataframe.

CODE:

def get_df(features):
    mask = True
    for k, v in features.items():
        mask = mask & (df[k] == v)
    return df.loc[mask]

Input Features:

features = {
  "species": "bear",
  "weights": 350
}

Output:

>>> print(get_df(features))

  species  weights  population  color locations
0    bear      350        1864  brown        US
1    bear      350       22000  black    Canada
4    bear      350        7864  brown        US
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 2
    liked the idea, really! I mean I was thinking completely different, thanks! – cancan Apr 01 '20 at 08:14
  • Btw, I need to add something, I checked all of the answers given by other users in terms of elapsed time on a big dataframe, and this is the quickest option. – cancan Apr 01 '20 at 09:16
1

One option could be to build a DataFrame from the dictionary and use merge:

pd.Series(d).to_frame().T.merge(df)

  species weights  population  color locations
0    bear     350        1864  brown        US
1    bear     350       22000  black    Canada
2    bear     350        7864  brown        US

d being:

d = {
  'species': 'bear',
  'weights': 350
}
yatu
  • 86,083
  • 12
  • 84
  • 139
0

an addition to the solutions already offered:
you can loop through the keys in the dict that are in the columns, zip the booleans and pass the final output to ur loc :

dicts = {
         'species': 'bear',
         'weights': 350
         }

#loop through the keys in the dictionary that are in the columns:
loop = zip(*(df[col].eq(dicts[col]).tolist()
            for col in dicts.keys())
           )

 #get single values of True or False from the zipped loop
 cond = [all(bools) for bools in loop]

print(cond)
[True, True, False, False, True, False]

#pass the cond into ur dataframe
df.loc[cond]


  species   weights population  color   locations
0   bear      350   1864        brown   US
1   bear      350   22000       black   Canada
4   bear      350   7864        brown   US
sammywemmy
  • 27,093
  • 4
  • 17
  • 31