0

My end goal here is to create a primitive plagiarism checker given a text file. I plan to do this by first splitting the data by sentence, searching each sentence on Google, and finally searching each of the first few URL's returned by Google for occurrences of the sentence/substrings. This last step is the one I'm having trouble with.

When running through each URL in a for-loop, I first read the contents of the URL using urllib.open(), but I'm not sure what to do after. Code is attached below, with some solutions I've tried commented out. I've imported the googlesearch, urllib.request, and re libraries.

def plagCheck():

    global inpFile

    with open(inpFile) as data:
        sentences = data.read().split(".")

    for sentence in sentences:
        for url in search(sentence, tld='com', lang='en', num=5, start=0, stop=5, pause=2.0):
            content = urlopen(url).read()

            # if sentence in content:
            #     print("yes")
            # else:
            #     print("no")

            # matches = findall(sentence, content)
            # if len(matches) == 0:
            #     print("no")
            # else:
            #     print("yes")

           

martineau
  • 119,623
  • 25
  • 170
  • 301

1 Answers1

0

If I understand your code correctly, you now have two Python lists of sentences. It looks like you have split them using a period. This would create fairly large run-on sentences for other types of punctuation (?, !).

I would consider using a similarity checker library. Diflibb has a simliar class Then decide on some percentage to flag i.e. if it's 40% the same. This reduces the amount of content you have to check manually.

Expanding the number of punctuations. That might look something like this:

with open(inpFile) as data:
        # Replace all !, ? with .
        sentences = data.read().replace("!", ".").replace("?", ".").split(".")

Then I would write your results for this file back to a new output file, something like this

# loop each sentence and run it through google
# Compare those two sentences with the sequence matcher linked above (Difflib) 
# Add them to a dictionary with the percent, url, and sentence in question
# Sample result
results = {"sentence_num": 0, "percent": 0.8, "url": "the google url found on", "original_sentence": "Red green fox over the wall"
}
outputStr = "<html>"
# loop the results and format the dictionary in a way that you can read. Ideally an HTML table with columns representing the keys above
outputStr += "<table>" # etc
with open(outputFile) as results:
   results.write(outputStr)




You could even go as far as to highlight table rows based on the percentage i.e.

80% and above is red 61-79% orange 40-60% yellow 39% and below is green

Krowvin
  • 86
  • 1
  • 6
  • Thanks, the sequence matcher suggestion really helps! That was one aspect I was having trouble with, the other is actually getting the text pulled from the URL. Do you have any other tips for this? – anirudhc1229 May 13 '21 at 18:38
  • I was not sure what "search()" did in your code. I don't see a function or an import for it. So assumed you had that figured out. I would use a Google API for search results. I found [this](https://developers.google.com/custom-search/v1/overview) one doing a quick search on the subject. The idea being you would query the API for a "search string", the sentence you are wanting to check, and the result would be returned in JSON. You could use `import json` to parse the json in Python into a sentence you can check with the sequence matcher. Look at the other algorithms in the link I sent. – Krowvin May 13 '21 at 18:51
  • I'm realizing now that the Google API I linked is paid beyond 100 free searches. There are most likely other "free" google API that you can use without cost. Here's a post on [plagiarism API](https://www.quora.com/What-is-the-best-Plagiarism-Checker-with-free-API) and costs. And perhaps the URL you are using already lets you check against some source for free. If you're needing to parse HTML from a page to check against your sentences that's another story. – Krowvin May 13 '21 at 18:56
  • 1
    Just letting you know I got it working, thanks for your help! I used the beauitfulSoup library for parsing of the HTML into actual text – anirudhc1229 May 13 '21 at 21:11