0

I have a Discord bot written in Python and I wanted to add a feature that would make it immediately delete any phishing links it finds.

I looked for a list of known phishing domains and I found this on GitHub.

However the issue is that this is a JS file with one big array, and my bot is 100% Python.

I could just make a copy of this list, but then I lose the advantage of it being constantly updated, so I would like to read the domains directly from GitHub, if possible.

I am not sure how to get and parse this into a Python list.

Looking around on StackOverflow people are suggesting parsing the data as JSON, or using regex, but unfortunately I haven't understood it all yet.

Guidance would help - or maybe you have a better way of doing things altogether rather than this approach! Thank you

Sidewinder
  • 369
  • 3
  • 13

2 Answers2

1

Here is one approach (prone to failure and definitely not the recommended way to do this):

import requests

RAW_DATA_LINK = "https://raw.githubusercontent.com/nikolaischunk/discord-phishing-links/main/domain-list.js"


def get_data():
    response = requests.get(RAW_DATA_LINK)
    data = response.content.decode()
    data = data.replace("const suspiciousDomains = ", "").replace(";", "")  # or just data[26:-2]
    return eval(data)

get_data() will give you a list of all the links in that file. You could additionally try using sessions while making the request...

Again if you are in control of that file just store it as json and if you are not in control, you'd probably be better off with regular expressions.

Sujal Singh
  • 532
  • 1
  • 5
  • 14
  • Thank you! I converted this to [use aiohttp instead](https://pastebin.com/iUKmELy9) to avoid the extra import of `requests`. Could you elaborate on why this would be prone to failure/why it would be a problem to implement this? As long as the repo and this file exists, this will always work, right? – Sidewinder Dec 20 '21 at 07:01
  • 1
    Even if the file exists, there is a possibility of change in that file. Suppose the name was changed from `suspiciousDomains` to just `Domains`, that would break this program. (although you could accommodate this type of change by using split instead of replace) but I don't think you'll face these types of issues with regular expressions, you could use the expression `\[.*\] ` to extract the required data. Also, using eval on data downloaded from the internet has security risks. – Sujal Singh Dec 20 '21 at 13:12
  • 1
    @SujalSingh Using eval is a [huge security risk](https://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html) especially the author is trying to scrape the data off the internet – ytan11 Dec 28 '21 at 11:39
1

Disclaimer: I was not able to see the original js file, so there might be some inaccuracy. This answer was written to provide an alternative from using eval() as it is a huge security risk. Read Eval really is dangerous.

I assume the Javascript file is something like this:

const suspiciousDomains = {
  "domains": [
    "tinyurl.com/yyw8sy9b",
    "tinyurl.com/yyyz9xdg",
    "token-bit.com"
  ]
};
import requests
import json  

RAW_DATA_LINK = "https://raw.githubusercontent.com/nikolaischunk/discord-phishing-links/main/domain-list.js" // the now dead link

def get_data():
    # credit to @Sujal Singh
    data = response.content.decode().replace("const suspiciousDomains = ", "").replace(
        ";", "")  # or just data[26:-2]
    # use json.loads() instead
    return json.loads(data) 

json.loads() does not evaluate the string directly but instead parse the string.
To see what json.loads() do, you can read this.

ytan11
  • 908
  • 8
  • 18
  • 1
    The actual file contains just a list not the dictionary but I suppose with a bit of modification this would work perfectly and for anyone in the future reading this answer prefer this one over mine below due to the security risks... – Sujal Singh Jan 03 '22 at 01:13