2

I am running a script to scape a website for textbook information and I have the script working. However, when it writes to a JSON file it is giving me duplicate results. I am trying to figure out how to remove the duplicates from the JSON file. Here is my code:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = ['https://open.bccampus.ca/find-open-textbooks/', 
'https://open.bccampus.ca/find-open-textbooks/?start=10']

data = []
#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.findAll("h4")

    for container in containers:
       item = {}
       item['type'] = "Textbook"
       item['title'] = container.parent.a.text
       item['author'] = container.nextSibling.findNextSibling(text=True)
       item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + container.parent.a["href"]
       item['source'] = "BC Campus"
       data.append(item) # add the item to the list

with open("./json/bc.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

Here is a sample of the JSON output

{
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}
brawlins4
  • 322
  • 8
  • 22
  • **@brawlins4**, your question is excellent and interesting. Please add Python version (as it's **Python3**) & dependency (it's **beautifulsoup4**) for your code so that the answerers/readers would be familiar with the required environment & dependencies before trying the code. As I also first tried with **Python2.7**, it didn't work. Then I searched and found the syntax of **urllib** module belongs to **Python3** so I created a new **conda** environment with **Python3.6**, activated it and installed **beautifulsoup4** using **pip**. Anyway, this was very interesting as I spent time to solve. – hygull May 05 '18 at 03:00
  • Dear **@brawlins4**, also please specify one more requirement to create a **json** folder as your code is saving the list of dictionaries as **JSON** in a file called **bc.json** which resides in **./json** directory. If someone(like me) will directly copy and run (without looking at code) the code, it will fail. So better is to specify the path like **./bc.json** for **open()**. Enough suggestion is bad. You are master and know about all this things. I just suggested to improve the power of your question. Thanks. – hygull May 05 '18 at 03:12

3 Answers3

1

Figured it out. Here is the solution in case anyone else runs into this issue:

textbook_list = []
for item in data:
    if item not in textbook_list:
        textbook_list.append(item)

with open("./json/bc.json", "w") as writeJSON:
    json.dump(textbook_list, writeJSON, ensure_ascii=False)
brawlins4
  • 322
  • 8
  • 22
0

You do not need to remove any kind of duplicates.

The only need is to update the code.

Please keep reading. I have provided detailed description related to this problem. Also don't forget to check this gist https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c which I had written to debug your code.

» WHERE THE PROBLEM WAS?

I know you want this because you're getting duplicated dictionaries.

This is because you're selecting containers as h4 elements & f or each book details, the specified page links https://open.bccampus.ca/find-open-textbooks/ and https://open.bccampus.ca/find-open-textbooks/?start=10 are having 2 h4 elements.

That's why, instead of getting a list of 20 items(10 from each page) as containers list you're getting just double i.e. list of 40 items where each item is h4 element.

You may get different different values for each of these 40 items but the problem is while selecting parents. As it gives the same element so the same text.

Let's clarify the problem by assuming the following dummy code.

Note: You can also visit and check https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c as it has the Python code which I have created to debug and solve this problem. You may get some IDEA.

<li> <!-- 1st book -->
    <h4>
        <a> Text 1 </a>
    </h4>
    <h4>
        <a> Text 2 </a>
    </h4>
</li>
<li> <!-- 2nd book -->
    <h4>
        <a> Text 3 </a>
    </h4>
    <h4>
        <a> Text 4 </a>
    </h4>
</li>
...
...
<li> <!-- 20th book -->
    <h4>
        <a> Text 39 </a>
    </h4>
    <h4>
        <a> Text 40 </a>
    </h4>
</li>

»» containers = page_soup.find_all("h4"); will give the below list of h4 elements.

[
    <h4>
        <a> Text 1 </a>
    </h4>,
    <h4>
        <a> Text 2 </a>
    </h4>,
    <h4>
        <a> Text 3 </a>
    </h4>,
    <h4>
        <a> Text 4 </a>
    </h4>,
    ...
    ...
    ...
    <h4>
        <a> Text 39 </a>
    </h4>,
    <h4>
        <a> Text 40 </a>
    </h4>
]

»» In case of your code, 1st iteration of inner for loop will refer below element as container variable.

<h4>
    <a> Text 1 </a>
</h4>

»» 2nd iteration will refer below element as container variable.

<h4>
    <a> Text 1 </a>
</h4>

»» In both the above (1st, 2nd) iterations of inner for loop, container.parent; will give the below element.

<li> <!-- 1st book -->
    <h4>
        <a> Text 1 </a>
    </h4>
    <h4>
        <a> Text 2 </a>
    </h4>
</li>

»» And container.parent.a will give the below element.

<a> Text 1 </a>

»» Finally, container.parent.a.text gives the below text as our book title for first 2 books.

Text 1

That's why we are getting duplicated dictionaries as our dynamic title & author are also same.

Let's get rid of this problem 1 by 1.

» WEB PAGE DETAILS:

  1. We have links of 2 web pages.

enter image description here

enter image description here

  1. Each web page is having details of 10 text books.

  2. Each book details is having 2 h4 elements.

  3. So total, 2x10x2 = 40 h4 elements.

» OUR GOAL:

  1. Our goal is to only get an array/list of 20 dictionaries not 40.

  2. So there's a need to iterate the containers list by 2 items i.e. by just skipping 1 item in each iteration.

» MODIFIED WORKING CODE:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = [
  'https://open.bccampus.ca/find-open-textbooks/', 
  'https://open.bccampus.ca/find-open-textbooks/?start=10'
]

data = []

#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.find_all("h4")

    for index in range(0, len(containers), 2):
        item = {}
        item['type'] = "Textbook"
        item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
        item['source'] = "BC Campus"
        item['title'] = containers[index].parent.a.text
        item['authors'] = containers[index].nextSibling.findNextSibling(text=True)

    data.append(item) # add the item to the list

with open("./json/bc-modified-final.json", "w") as writeJSON:
  json.dump(data, writeJSON, ensure_ascii=False)

» OUTPUT:

[
    {
        "type": "Textbook",
        "title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
        "authors": " Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
        "source": "BC Campus"
    },
    {
        "type": "Textbook",
        "title": "Exploring Movie Construction and Production",
        "authors": " John Reich, SUNY Genesee Community College",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
        "source": "BC Campus"
    },
    {
        "type": "Textbook",
        "title": "Project Management",
        "authors": " Adrienne Watt",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
        "source": "BC Campus"
    },
    ...
    ...
    ...
    {
        "type": "Textbook",
        "title": "Naming the Unnamable: An Approach to Poetry for New Generations",
        "authors": " Michelle Bonczek Evory. Western Michigan University",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
        "source": "BC Campus"
    }
]

Finally, I tried to modify your code and added more details description, date & categories to dictionary object.

Python version: 3.6

Dependency: pip install beautifulsoup4

» MODIFIED WORKING CODE (ENHANCED VERSION):

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = [
    'https://open.bccampus.ca/find-open-textbooks/', 
    'https://open.bccampus.ca/find-open-textbooks/?start=10'
]

data = []

#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.find_all("h4")

    for index in range(0, len(containers), 2):
        item = {}

        # Store book's information as per given the web page (all 5 are dynamic)
        item['title'] = containers[index].parent.a.text
        item["catagories"] = [a_tag.text for a_tag in containers[index + 1].find_all('a')]
        item['authors'] = containers[index].nextSibling.findNextSibling(text=True).strip()
        item['date'] = containers[index].parent.find_all("strong")[1].findNextSibling(text=True).strip()
        item["description"] = containers[index].parent.p.text.strip()

        # Store extra information (1st is dynamic, last 2 are static)
        item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
        item['source'] = "BC Campus"
        item['type'] = "Textbook"

        data.append(item) # add the item to the list

with open("./json/bc-modified-final-my-own-version.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

» OUTPUT (ENHANCED VERSION):

[
    {
        "title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
        "catagories": [
            "Ancillary Resources"
        ],
        "authors": "Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
        "date": "May 3, 2018",
        "description": "Description: The purpose of this textbook is to help learners develop best practices in vital sign measurement. Using a multi-media approach, it will provide opportunities to read about, observe, practice, and test vital sign measurement.",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    },
    {
        "title": "Exploring Movie Construction and Production",
        "catagories": [
            "Adopted"
        ],
        "authors": "John Reich, SUNY Genesee Community College",
        "date": "May 2, 2018",
        "description": "Description: Exploring Movie Construction and Production contains eight chapters of the major areas of film construction and production. The discussion covers theme, genre, narrative structure, character portrayal, story, plot, directing style, cinematography, and editing. Important terminology is defined and types of analysis are discussed and demonstrated. An extended example of how a movie description reflects the setting, narrative structure, or directing style is used throughout the book to illustrate ...[more]",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    },
    ...
    ...
    ...
    {
        "title": "Naming the Unnamable: An Approach to Poetry for New Generations",
        "catagories": [],
        "authors": "Michelle Bonczek Evory. Western Michigan University",
        "date": "Apr 27, 2018",
        "description": "Description: Informed by a writing philosophy that values both spontaneity and discipline, Michelle Bonczek Evory’s Naming the Unnameable: An Approach to Poetry for New Generations  offers practical advice and strategies for developing a writing process that is centered on play and supported by an understanding of America’s rich literary traditions. With consideration to the psychology of invention, Bonczek Evory provides students with exercises aimed to make writing in its early stages a form of play that ...[more]",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    }
]

That's it. Thanks.

hygull
  • 8,464
  • 2
  • 43
  • 52
0

We better use set data structures instead of a list. It doesn't preserve the order but it doesn't store the duplicates like list.

Change your code

 data = []

To

data = set()

And

data.append(item)

To

data.add(item)
anjaneyulubatta505
  • 10,713
  • 1
  • 52
  • 62