I have multiple dataframes in a dictionary with this structure (the value should depict the dataframe):
import pandas as pd
import requests
ADWARE_MALWARE="https://raw.githubusercontent.com/ScriptTiger/scripttiger.github.io/master/alts/domains/blacklist.txt"
FAKENEWS="https://raw.githubusercontent.com/ScriptTiger/scripttiger.github.io/master/alts/domains/blacklist-f.txt"
GAMBLING="https://raw.githubusercontent.com/ScriptTiger/scripttiger.github.io/master/alts/domains/blacklist-g.txt"
PORN="https://raw.githubusercontent.com/ScriptTiger/scripttiger.github.io/master/alts/domains/blacklist-p.txt"
SOCIAL="https://raw.githubusercontent.com/ScriptTiger/scripttiger.github.io/master/alts/domains/blacklist-s.txt"
class Blocklist:
def req(self, url: str) -> list:
req = requests.get(url)
lst = []
if req.status_code == 200:
read_data = req.content
read_data = read_data.decode('utf-8')
for line in read_data.splitlines():
if not line.startswith("#"):
lst.append(line)
return lst
else:
raise Exception('Website not available: Error ', req.status_code)
def create_df(self, blocklist: list, blocklist_name: str) -> pd.DataFrame:
df = pd.DataFrame({'Domain': blocklist, 'Blocklist Name': blocklist_name})
return df
def insert(self):
dic = {
ADWARE_MALWARE: "ads_malware",
FAKENEWS: "fakenews",
GAMBLING: "gambling",
PORN: "porn",
SOCIAL: "social"
}
d = {}
for key, value in dic.items():
blocklist = Blocklist().req(key)
d[value] = Blocklist().create_df(blocklist, value)
print(d[value])
# Current dataframes
Domain Blocklist Name
0 ck.getcookiestxt.com ads_malware
1 eu1.clevertap-prod.com ads_malware
2 wizhumpgyros.com ads_malware
Domain Blocklist Name
0 ck.getcookiestxt.com fakenews
1 eu1.clevertap-prod.com fakenews
2 wizhumpgyros.com fakenews
3 yournationnews.com fakenews
4 yournewswire.com fakenews
Domain Blocklist Name
0 ck.getcookiestxt.com gambling
1 eu1.clevertap-prod.com gambling
2 wizhumpgyros.com gambling
3 zebrabet.com.au gambling
4 zenitbet.com gambling
Domain Blocklist Name
0 ck.getcookiestxt.com porn
1 eu1.clevertap-prod.com porn
2 wizhumpgyros.com porn
3 www.zetton-av.com porn
4 www.zeus-web.net porn
Domain Blocklist Name
0 ck.getcookiestxt.com social
1 eu1.clevertap-prod.com social
2 wizhumpgyros.com social
3 match.com social
4 mbga.jp social
Domain Blocklist Name
0 ck.getcookiestxt.com fakenews
1 eu1.clevertap-prod.com fakenews
2 wizhumpgyros.com fakenews
3 yournationnews.com fakenews
4 yournewswire.com fakenews
# Expected dataframes
Domain Blocklist Name
0 ck.getcookiestxt.com ads_malware
1 eu1.clevertap-prod.com ads_malware
2 wizhumpgyros.com ads_malware
Domain Blocklist Name
0 yournationnews.com fakenews
1 yournewswire.com fakenews
Domain Blocklist Name
0 zebrabet.com.au gambling
1 zenitbet.com gambling
Domain Blocklist Name
0 www.zetton-av.com porn
1 www.zeus-web.net porn
Domain Blocklist Name
0 match.com social
1 mbga.jp social
Domain Blocklist Name
0 yournationnews.com fakenews
1 yournewswire.com fakenews
All of the other dataframes after adware_malware
contain the data of adware_malware
. This structure does not come from the fact that only append the new data to the existing. This happens because the source of this data is built like that. I want to remove them in order to prevent duplicates. I'm not sure how to implement this with pandas pd.merge
.