0

So here is an example of what my dataset would look like below. It would probably be in the form of a pandas dataframe, but can be anything that is accessible in python.

enter image description here

Let's say for example my 'Thing' of interest was 'Egg'. I would like to search through all of the rows of the Thing1 and Thing2 columns for whenever 'Egg' appears in one of these columns, and if the value in the PairScore column in this same same row is below 3, then I would like to return the 'Thing' that occupies the other column.

So for this example, the first 'Thing' that would be returned would be 'Football' from row 9. And then would want to store all of the 'Things' that get returned in something like a list or maybe a new dataframe.

I hope this makes sense, if I need to expand on anything please let me know! :)

BanAckerman
  • 103
  • 1
  • 8
  • It's much preferred to post data as text rather than images per **DO NOT post images of code, data, error messages** from [How to Ask a Good Question](https://stackoverflow.com/help/how-to-ask) – DarrylG Aug 03 '22 at 00:38
  • @DarrylG Apologies, I didn't realise that was the case. I will quickly try to figure out how to now and edit it to change that :) Thanks for the heads up – BanAckerman Aug 03 '22 at 00:41
  • for add your data as code, easily run `df.head(20).to_dict()` and copy and paste the output here – MoRe Aug 03 '22 at 00:44
  • Cf. [How to make good reproducible pandas examples](https://stackoverflow.com/a/20159305/15873043) – fsimonjetz Aug 03 '22 at 08:37

2 Answers2

1
prevs = []
data = df[(~df[['Thing1', 'Thing2']].isin(prevs)).all(1)) & (df[['Thing1', 'Thing2']] == 'Egg').any(1) & (df['PairScore'] < 3)]
df = pd.Series(data.values.flatten())
prevs += df.loc[df != "Egg"].tolist() + "Egg"
MoRe
  • 2,296
  • 2
  • 3
  • 23
  • @ScottBoston Thank you guys, this is amazing :) Would there be a way to repeat the search for a new instance of 'Thing', BUT, in this new search, disregard/ignore/not include any of the elements that were returned from the previous search? So that the new search will only return new instances and not any of the instances from the dataframe that was created from the first search? – BanAckerman Aug 03 '22 at 01:18
  • @BanAckerman in this code, every time we are producing a new dataframe that is independent and there is not any relationship between 2 runs – MoRe Aug 03 '22 at 01:31
  • id I don't understand you, please add your desired output as a table or something same – MoRe Aug 03 '22 at 01:33
  • @MoRe Sorry let me explain again. So I would want to do a second search on a new 'Thing' but I want the second dataframe to not include any of the 'Things' from the first search. e.g. if my initial 'Egg' search from the original example returns a dataframe containing 'Football', 'Rome', and 'Elephant'. Now I want to do a second search, let's say for 'Dog'. However, the new dataframe that is returned will NOT be allowed to include 'Football', 'Rome', 'Elephant' or 'Egg' and only include new instances – BanAckerman Aug 03 '22 at 01:41
  • @MoRe I am thinking it would just be at the df.loc[df != [list of elements from first database]]. Does this make sense? – BanAckerman Aug 03 '22 at 01:44
  • 1
    @BanAckerman code was updated please check it – MoRe Aug 03 '22 at 01:48
  • Hi, thanks for the updated code but it gives me a key error when I run it - KeyError: "None of [Index(['Thing1', 'Thing2'], dtype='object')] are in the [index]". Any idea why this is? Also, is there an accidental extra parenthesis after isin.prevs)) or .all(1)) in the first line? I got some sort of parentheses error initially. – BanAckerman Aug 04 '22 at 18:07
1

For two columns, I'd just use a combination of boolean indexes:

data = {'Thing1': ['Egg', 'Apple', 'Dog', 'Cat', 'Banana', 'Man', 'Elephant', 'Football', 'Egg', 'Elephant', 
                   'Rome', 'Pakistan', 'Egg', 'Dog', 'Banana', 'Rome', 'Apple', 'Lime', 'Album'], 
        'Thing2': ['Lemon', 'Birmingham', 'Egg', 'Dog', 'Germany', 'Flower', 'Banana', 'Egg', 'Birmingham', 
                   'School', 'Lake', 'Swimmer', 'Woman', 'Flower', 'Football', 'Egg', 'Waterfall', 'Egg', 'Bat'], 
        'PairScore': [3, 2, 5, 6, 4, 1, 8, 2, 6, 4, 3, 2, 4, 5, 6, 7, 3, 2, 1]}

df = pd.DataFrame(data)
>>> df.head()
   Thing1      Thing2  PairScore
0     Egg       Lemon          3
1   Apple  Birmingham          2
2     Dog         Egg          5
3     Cat         Dog          6
4  Banana     Germany          4

Selecting as per your criteria

m = df.Thing1.eq("Egg") | df.Thing2.eq("Egg")
# or even m = df.filter(like="Thing").eq("Egg").any(axis=1)
df2 = df[m & df.PairScore.lt(3)]

Result:

      Thing1 Thing2  PairScore
7   Football    Egg          2
17      Lime    Egg          2

From there, one option to proceed is to .stack() the Thing columns into a single column, e.g., do df3[~df3.eq("Egg")] to get rid of rows containing "Egg".

>>> df3 = df2.filter(like="Thing").stack()
>>> df3
7   Thing1    Football
    Thing2         Egg
17  Thing1        Lime
    Thing2         Egg
dtype: object
>>> df3[~df3.eq("Egg")]
7   Thing1    Football
17  Thing1        Lime
dtype: object

In order to do subsequent analyses on df, excluding the rows from df2 above, you can proceed with

>>> df4 = pd.concat([df, df2]).drop_duplicates(keep=False)

df4 now contains everything from df except of the previous matches:

>>> df4.tail()
    Thing1     Thing2  PairScore
13     Dog     Flower          5
14  Banana   Football          6
15    Rome        Egg          7
16   Apple  Waterfall          3
18   Album        Bat          1 # 17 is gone
fsimonjetz
  • 5,644
  • 3
  • 5
  • 21
  • 1
    This is great thank you! Worked perfectly. For the final part, when wanting to conduct further searches on the df, I'd also want the new df to no longer include any instances of the initial query thing. I suppose this would just be as simple as removing the rows from df4 containing 'Egg'. I think I answered my own question whilst writing this response haha. I shall just go ahead and do that. Thank you! – BanAckerman Aug 04 '22 at 18:44