0

I have multiple dataframes in a dictionary with this structure (the value should depict the dataframe):

import pandas as pd
import requests

ADWARE_MALWARE="https://raw.githubusercontent.com/ScriptTiger/scripttiger.github.io/master/alts/domains/blacklist.txt"
FAKENEWS="https://raw.githubusercontent.com/ScriptTiger/scripttiger.github.io/master/alts/domains/blacklist-f.txt"
GAMBLING="https://raw.githubusercontent.com/ScriptTiger/scripttiger.github.io/master/alts/domains/blacklist-g.txt"
PORN="https://raw.githubusercontent.com/ScriptTiger/scripttiger.github.io/master/alts/domains/blacklist-p.txt"
SOCIAL="https://raw.githubusercontent.com/ScriptTiger/scripttiger.github.io/master/alts/domains/blacklist-s.txt"

class Blocklist:
    def req(self, url: str) -> list:
        req = requests.get(url)
        lst = []
        
        if req.status_code == 200:
            read_data = req.content
            read_data = read_data.decode('utf-8')
        
            for line in read_data.splitlines():
                if not line.startswith("#"):
                    lst.append(line)

            return lst
        else:
            raise Exception('Website not available: Error ', req.status_code) 

    def create_df(self, blocklist: list, blocklist_name: str) -> pd.DataFrame:
        df = pd.DataFrame({'Domain': blocklist, 'Blocklist Name': blocklist_name})
        return df
    
    def insert(self):
        dic = {
            ADWARE_MALWARE: "ads_malware",
            FAKENEWS: "fakenews",
            GAMBLING: "gambling",
            PORN: "porn",
            SOCIAL: "social"
        }
        d = {}
        for key, value in dic.items():
            blocklist = Blocklist().req(key)
            d[value] = Blocklist().create_df(blocklist, value)
            print(d[value])
# Current dataframes
                                   Domain Blocklist Name
0                    ck.getcookiestxt.com    ads_malware
1                  eu1.clevertap-prod.com    ads_malware
2                        wizhumpgyros.com    ads_malware

                                   Domain Blocklist Name
0                    ck.getcookiestxt.com       fakenews
1                  eu1.clevertap-prod.com       fakenews
2                        wizhumpgyros.com       fakenews

3                      yournationnews.com       fakenews
4                        yournewswire.com       fakenews

                                   Domain Blocklist Name
0                    ck.getcookiestxt.com       gambling
1                  eu1.clevertap-prod.com       gambling
2                        wizhumpgyros.com       gambling

3                         zebrabet.com.au       gambling
4                            zenitbet.com       gambling

                                   Domain Blocklist Name
0                    ck.getcookiestxt.com           porn
1                  eu1.clevertap-prod.com           porn
2                        wizhumpgyros.com           porn

3                       www.zetton-av.com           porn
4                       www.zeus-web.net           porn

                                   Domain Blocklist Name
0                    ck.getcookiestxt.com         social
1                  eu1.clevertap-prod.com         social
2                        wizhumpgyros.com         social

3                               match.com         social
4                                 mbga.jp         social

                                   Domain Blocklist Name
0                    ck.getcookiestxt.com       fakenews
1                  eu1.clevertap-prod.com       fakenews
2                        wizhumpgyros.com       fakenews

3                      yournationnews.com       fakenews
4                        yournewswire.com       fakenews
# Expected dataframes
                                   Domain Blocklist Name
0                    ck.getcookiestxt.com    ads_malware
1                  eu1.clevertap-prod.com    ads_malware
2                        wizhumpgyros.com    ads_malware

                                   Domain Blocklist Name
0                      yournationnews.com       fakenews
1                        yournewswire.com       fakenews

                                   Domain Blocklist Name
0                         zebrabet.com.au       gambling
1                            zenitbet.com       gambling

                                   Domain Blocklist Name
0                       www.zetton-av.com           porn
1                        www.zeus-web.net           porn

                                   Domain Blocklist Name
0                               match.com         social
1                                 mbga.jp         social

                                   Domain Blocklist Name
0                      yournationnews.com       fakenews
1                        yournewswire.com       fakenews

All of the other dataframes after adware_malware contain the data of adware_malware. This structure does not come from the fact that only append the new data to the existing. This happens because the source of this data is built like that. I want to remove them in order to prevent duplicates. I'm not sure how to implement this with pandas pd.merge.

Jan
  • 93
  • 9
  • 1
    Please provide a **real, meaningful, reproducible** example, and the matching expected output. Clarity is important to ensure we're not wasting time trying to answer something that would be different from your real data. – mozway Aug 04 '23 at 08:20
  • Thanks for your update, but those are **not** DataFrames. Can you provide the output of `{k: df.head().to_dict('tight') for k, df in dic.items()}` where `dic` is your dictionary? – mozway Aug 04 '23 at 08:36
  • OK, thanks but now I don't understand what you mean by `[(ck.getcookiestxt.com, fakenews), (abc.com, fakenews)]` should "*look like this*" `[(ck.getcookiestxt.com, fakenews)]`. – mozway Aug 04 '23 at 08:48
  • The goal is to remove the data from `fakenews, gambling, porn, social` which are already in `ads_malware`. – Jan Aug 04 '23 at 08:51
  • You must provide the exact expected output, currently your question is unclear. You can start by providing the output for a single DataFrame. – mozway Aug 04 '23 at 08:53
  • I updated the first example – Jan Aug 04 '23 at 08:58
  • To be clear, what you showed it not an example as it doesn't show the real data. What I requested with `{k: df.head().to_dict('tight') for k, df in dic.items()}` shows a completely different format and no overlapping data. Please **craft** a meaningful example otherwise it's **impossible** to answer your question without ambiguity. Please read [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). Without that I'll just give up and vote to close the question. – mozway Aug 04 '23 at 09:02
  • I edited it and now you should see, that the first 5 entries are all the same. – Jan Aug 04 '23 at 09:05
  • Ok but please make the example minimal: only 4-5 meaningful rows. Not just the first 4-5 but a well crafted example that would demonstrate cases to remove and cases to keep. And importantly provide the **matching** expected output. Asking a correct question is not easy, but you have to do the work. – mozway Aug 04 '23 at 09:07
  • I edited the output and added an expected output. I also added my code. – Jan Aug 04 '23 at 09:15
  • OK, see if the below answer is what you want – mozway Aug 04 '23 at 09:32
  • Yes, that worked! Thanks a lot! – Jan Aug 04 '23 at 09:35
  • I actually updated the answer, depending on whether you want to consider only "ads_malware" or all previous DataFrames to remove the duplicates. – mozway Aug 04 '23 at 09:37

1 Answers1

1

If you only want to consider ads_malware to remove the duplicates, use:

out = {k: g[~g['Domain'].isin(dct['ads_malware']['Domain'])] 
          if k != 'ads_malware' else g
       for k, g in dct.items()}

If you want to remove duplicates based on all previous DataFrames, use concat, drop_duplicates, and a dictionary comprehension:

out = {k: g.droplevel(0) for k, g in
       pd.concat(dct).drop_duplicates('Domain').groupby(level=0)}

Output:

{'ads_malware':                    Domain Blocklist Name
                0    ck.getcookiestxt.com    ads_malware
                1  eu1.clevertap-prod.com    ads_malware
                2        wizhumpgyros.com    ads_malware,
 'fakenews':                   Domain Blocklist Name
                3  yournationnews.com       fakenews
                4    yournewswire.com       fakenews,
 'gambling':                Domain Blocklist Name
                3  zebrabet.com.au       gambling
                4     zenitbet.com       gambling,
 'porn':                      Domain Blocklist Name
                3  www.zetton-av.com           porn
                4   www.zeus-web.net           porn,
 'social':            Domain Blocklist Name
                3  match.com         social
                4    mbga.jp         social
}

Reproducible input:

dct = {'ads_malware': pd.DataFrame({'Domain': ['ck.getcookiestxt.com', 'eu1.clevertap-prod.com', 'wizhumpgyros.com'],
                                    'Blocklist Name': ['ads_malware', 'ads_malware', 'ads_malware']}),
       'fakenews':    pd.DataFrame({'Domain': ['ck.getcookiestxt.com', 'eu1.clevertap-prod.com', 'wizhumpgyros.com', 'yournationnews.com', 'yournewswire.com'],
                                    'Blocklist Name': ['fakenews', 'fakenews', 'fakenews', 'fakenews', 'fakenews']}),
       'gambling':    pd.DataFrame({'Domain': ['ck.getcookiestxt.com', 'eu1.clevertap-prod.com', 'wizhumpgyros.com', 'zebrabet.com.au', 'zenitbet.com'],
                                    'Blocklist Name': ['gambling', 'gambling', 'gambling', 'gambling', 'gambling']}),
       'porn':        pd.DataFrame({'Domain': ['ck.getcookiestxt.com', 'eu1.clevertap-prod.com', 'wizhumpgyros.com', 'www.zetton-av.com', 'www.zeus-web.net'],
                                    'Blocklist Name': ['porn', 'porn', 'porn', 'porn', 'porn']}),
       'social':      pd.DataFrame({'Domain': ['ck.getcookiestxt.com', 'eu1.clevertap-prod.com', 'wizhumpgyros.com', 'match.com', 'mbga.jp'],
                                    'Blocklist Name': ['social', 'social', 'social', 'social', 'social']})
      }
mozway
  • 194,879
  • 13
  • 39
  • 75