I've been trying to create a script that parses Instagram usernames to tell me whether some are available. Originally I was using a status code check but it turns out Instagram doesn't actually use a code 404 on it's empty pages, and still returns 200. So I then imported beautiful soup to try and parse the html for the page title text, but for whatever reason, the html being parsed looks very different from the actual page, and lacks the title text as well as many other elements entirely. I'm also pretty certain all my logic at the bottom might be wrong.
So this is my code, sorry if it is horrible I'm new
`
#random word selector stuff
r = requests.get(url)
text = r.text
words = text.split()
rng = randint(0,len(words))
#random proxy selector stuff
ipreq = requests.get(url2)
nums = ipreq.text
ips = nums.split()
randomcheck = randint(0,len(ips))
#this one worked 103.149.130.38:80
proxies = {
'http': 'http://' + '103.149.130.38:80',
'https': 'https://' + ips\[randomcheck\],
}
print(proxies)
while True:
user = ""
for words[rng] in random.choices(words):
user = user + words[rng]
response = requests.get("http://www.instagram.com/{user}/",proxies=proxies,headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}, timeout=100)
ref = requests.get("http://www.instagram.com/{user}/")
ref.text
src = 'Page not found • Instagram'
soup = BeautifulSoup(ref.text, "html.parser")
print(soup.prettify())
if (soup.find_all(response) == src):
print(Fore.LIGHTBLUE_EX + f"NOT FOUND: {user}" + Fore.RESET)
elif (soup.find_all(response) != src):
print(Fore.GREEN + f"USER FOUND: {user}" + Fore.RESET)
else:
print("BLOCKED FROM INSTAGRAM")
time.sleep(15)
` I think most of the issue is happening in this block
`response = requests.get("http://www.instagram.com/%7Buser%7D/",proxies=proxies,headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}, timeout=100)
ref = requests.get("http://www.instagram.com/{user}/")
ref.text
src = 'Page not found • Instagram'
soup = BeautifulSoup(ref.text, "html.parser")
print(soup.prettify())
` And I think that last bit of logic probably needs fixing, but at the moment I think the main issue is the seemingly incorrect or incomplete parsing
For proper urls and stuff to try the code here's a pastebin it thought this post was spam https://pastebin.com/BStK839i