My end goal here is to create a primitive plagiarism checker given a text file. I plan to do this by first splitting the data by sentence, searching each sentence on Google, and finally searching each of the first few URL's returned by Google for occurrences of the sentence/substrings. This last step is the one I'm having trouble with.
When running through each URL in a for-loop, I first read the contents of the URL using urllib.open(), but I'm not sure what to do after. Code is attached below, with some solutions I've tried commented out. I've imported the googlesearch
, urllib.request
, and re
libraries.
def plagCheck():
global inpFile
with open(inpFile) as data:
sentences = data.read().split(".")
for sentence in sentences:
for url in search(sentence, tld='com', lang='en', num=5, start=0, stop=5, pause=2.0):
content = urlopen(url).read()
# if sentence in content:
# print("yes")
# else:
# print("no")
# matches = findall(sentence, content)
# if len(matches) == 0:
# print("no")
# else:
# print("yes")