0

I'm trying to remove duplicates from a list, before I write to a JSON file. I commented the lines where I implemented the code and added extra print statements for debugging. Based on my debugging the code does not get to the print statements and does not write to the JSON file either. My error lies within the function trendingBot(). Currently as the code stands with uncommenting anything, the duplicates will be written to the JSON file.

    convertToJson(quote_name, quote_price, quote_volume, url)

    quotesArr = []
    # Convert to a JSON  file


    def convertToJson(quote_name, quote_price, quote_volume, url):

        quoteObject = {
            "url": url,
            "Name": quote_name,
            "Price": quote_price,
            "Volume": quote_volume
        }
        quotesArr.append(quoteObject)


    def trendingBot(url, browser):
        browser.get(url)
        trending = getTrendingQuotes(browser)
        for trend in trending:
            getStockDetails(trend, browser)
        # requests finished, write json to file

        # REMOVE ANY DUPLICATE url from the list, then write json to file.
        quotesArr_dict = {quote['url']: quote for quote in quotesArr}
        # print(quotesArr_dict)
        quotesArr = list(quotesArr_dict.values())
        # print(quotesArr)
        with open('trendingQuoteData.json', 'w') as outfile:
            json.dump(quotesArr, outfile)

Json file with duplicated entries

[
  {
    "url": "https://web.tmxmoney.com/quote.php?qm_symbol=ACB&locale=EN",
    "Volume": "Volume:\n12,915,903",
    "Price": "$ 7.67",
    "Name": "Aurora Cannabis Inc."
  },

  {
    "url": "https://web.tmxmoney.com/quote.php?qm_symbol=HNL&locale=EN",
    "Volume": "Volume:\n548,038",
    "Price": "$ 1.60",
    "Name": "Horizon North Logistics Inc."
  },
  {
    "url": "https://web.tmxmoney.com/quote.php?qm_symbol=ACB&locale=EN",
    "Volume": "Volume:\n12,915,903",
    "Price": "$ 7.67",
    "Name": "Aurora Cannabis Inc."
  }
]
pennyBoy
  • 397
  • 2
  • 17
  • 2
    Do you think the entire code was necessary to reproduce the example? Please look at [mcve] for tips on how to ask a reproducible question so you can get an answer. – d_kennetz Dec 14 '18 at 23:05
  • 1
    @d_kennetz Sometimes I do it and I get flagged for it. Thanks tho. I updated my post. – pennyBoy Dec 14 '18 at 23:08
  • What is being duplicated? The URL? – OneCricketeer Dec 14 '18 at 23:17
  • @cricket_007 I updated my the question with the updated Json file. I would be filtering based on urls. – pennyBoy Dec 14 '18 at 23:19
  • Have you looked up "how to remove duplicates from a python list"? – juanpa.arrivillaga Dec 14 '18 at 23:20
  • Why not `json.dump(quotesArr_dict, outfile)`? – OneCricketeer Dec 14 '18 at 23:20
  • @juanpa.arrivillaga yes sir. I tried it a bunch of different ways but for some reason it still does not remove the duplicates. – pennyBoy Dec 14 '18 at 23:20
  • Still does not work, it suppose to be a list of dictionaries tho – pennyBoy Dec 14 '18 at 23:21
  • 1
    I just tested out my loop below using your list of dictionaries. It works. –  Dec 14 '18 at 23:25
  • @Jeremiah for some reason it does not write to the JSON file. I'm so confused. – pennyBoy Dec 14 '18 at 23:26
  • 1
    Really? I just tried that and it worked as well.`with open(r'C:\users\jeremiah\trendingQuoteData.json', 'w') as outfile: json.dump((newlist), outfile)` Are you sure you are checking the right spot for the file? –  Dec 14 '18 at 23:32
  • @Jeremiah `newlist = list(quotesArr_dict.values())` that works? – pennyBoy Dec 14 '18 at 23:38
  • @Jeremiah it works now. Maybe it's because I used the same variable `quotesArr` smh when I placed them into the first list. I thought it would have updated the list. – pennyBoy Dec 14 '18 at 23:42
  • 1
    Ok, I just updated the question to use your list, rather than my made up list, and show you what comes out on the other end. –  Dec 14 '18 at 23:44

3 Answers3

2

If you just want to remove duplicates from a list, you can do that like this:

    firstlist = [
  {
    "url": "https://web.tmxmoney.com/quote.php?qm_symbol=ACB&locale=EN",
    "Volume": "Volume:\n12,915,903",
    "Price": "$ 7.67",
    "Name": "Aurora Cannabis Inc."
  },

  {
    "url": "https://web.tmxmoney.com/quote.php?qm_symbol=HNL&locale=EN",
    "Volume": "Volume:\n548,038",
    "Price": "$ 1.60",
    "Name": "Horizon North Logistics Inc."
  },
  {
    "url": "https://web.tmxmoney.com/quote.php?qm_symbol=ACB&locale=EN",
    "Volume": "Volume:\n12,915,903",
    "Price": "$ 7.67",
    "Name": "Aurora Cannabis Inc."
  }
]
newlist=[]
for i in firstlist:
    if i not in newlist:
       newlist.append(i)

json.dumps(newlist)
>>>[{"url": "https://web.tmxmoney.com/quote.php?qm_symbol=ACB&locale=EN", "Volume": "Volume:\n12,915,903", "Price": "$ 7.67", "Name": "Aurora Cannabis Inc."}, {"url": "https://web.tmxmoney.com/quote.php?qm_symbol=HNL&locale=EN", "Volume": "Volume:\n548,038", "Price": "$ 1.60", "Name": "Horizon North Logistics Inc."}]

I used json.dumps to show you the return statement, but if you use json.dump to write it to a file, that works too. I tested that as well. It jsut doesn't provide a pretty return statement.

  • I updated my post. The two lines `quotesArr_dict = {quote['url']: quote for quote in quotesArr} quotesArr = list(quotesArr_dict.values())` are suppose to remove the duplicates based on the url – pennyBoy Dec 14 '18 at 23:12
  • Ok, well the above code should work. Just use `quotesArr` in place of `firstlist`. –  Dec 14 '18 at 23:15
  • 3
    Isn't this O(n^2) runtime? You can do the same with `newlist = list(set(firstlist))` – OneCricketeer Dec 14 '18 at 23:17
  • It's a list of dictionaries. If you try to use `set` it doesn't work because dictionaries are unhashable. –  Dec 14 '18 at 23:24
  • For the question, yes, but your answer isn't using dictionaries, though. – OneCricketeer Dec 14 '18 at 23:24
  • Well, his json file wasn't posted yet, and if I have to create my own list, I'd rather use numbers than dictionaries. It's easier and this code works the same either way. –  Dec 14 '18 at 23:26
1

I would try with an actual loop rather than a dict-comprehension

quote_dict = dict()        
for quote in quotesArr:
    url = quote['url']
    if url not in quote_dict:
        quote_dict[url] = quote  # Only add if url is not already in dict

with open('trendingQuoteData.json', 'w') as outfile:
    json.dump(list(quotesArr_dict.values()), outfile)

And rather than dictionaries, I would create a Quote class that implements at least __eq__ so that you can determine equality.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • What I had intitally does actually work, but your way is also feasible. Thank you tho. – pennyBoy Dec 14 '18 at 23:44
  • My question was how to remove the duplicates. I thought the way I did initially was not the best way. – pennyBoy Dec 15 '18 at 03:02
  • Your way will always overwrite the matching url, thus getting the final duplicate. My answer only gets the first one... Therefore, you're not preventing duplicates during the code execution, you're overwriting matching events – OneCricketeer Dec 15 '18 at 13:09
0

The easiest way to do this is to convert it to a set, then convert that back to a list:

mylist = [1,2,3,1,2,3]
mylist2 = list(set(mylist))

print(mylist)
print(mylist2)

This will be the output:

[1, 2, 3, 1, 2, 3]
[1, 2, 3]
Pika Supports Ukraine
  • 3,612
  • 10
  • 26
  • 42
  • It's a list of dictionaries. You can't use `set` on a list of dictionaries. –  Dec 14 '18 at 23:39