How to remove duplicate values from a beautiful soup result set whilst preserving order?

Question

I have a scenario where I am searching through values in a beautiful soup result set and treating them differently depending on their contents, eg:

for i in bs_result_set:
    if 'this unique string' in i.text:
        print 'aaaa'
    else:
        print 'bbbb'

However I have realised that the unique condition actually occurs twice in the result set however I do not need that second replicate value and therefore want to remove it from the result set in the first place.

I have tried approaches to removing duplicate values in a list (whilst preserving order) but these do not seem to work on an object that is a beautiful soup result set. Eg i used logic from this post to try:

from collections import OrderedDict 
OrderedDict.fromkeys(bs_result_set).keys()

But that didn't seem to remove the duplicate values.

So my question is how do i remove duplicate values from a beautiful soup result set whilst preserving order?

What defines a duplicate though? Are the attribute values equal? Or just the attribute names? Should the textual content match exactly or just both have the same substring? What about nested elements? — Martijn Pieters, May 02 '13 at 10:56
good questions, the values are exact duplicates, they are both a div containing lots of text, html tags and comments. — user1063287, May 02 '13 at 11:01
It is interesting then that the `OrderedDict.fromkeys()` trick does not work for you; BS4 `Tag` elements define equality just like that; same name, same attributes (names and values) and same contents (tested recursively). Can you test if `elemA == elemB` is `True` for the elements that you think are duplicates? — Martijn Pieters, May 02 '13 at 11:52

score 0 · Answer 1 · 2013-05-02T11:14:49.920

0

What about:

h = {}
for i in bs_result_set:
    if i not in h:
        if 'this unique string' in i.text:
            print 'aaaa'
        else:
            print 'bbbb'
        h[i] = 1

If the key is not i but found from i (computed, field, etc.), you can do

h = {}
for i in bs_result_set:
    key = <some formula involving i>
    if key not in h:
        if 'this unique string' in i.text:
            print 'aaaa'
        else:
            print 'bbbb'
        h[key] = 1

edited May 02 '13 at 11:14

answered May 02 '13 at 11:01

what does `h[i] = 1` do? is that somehow saying `i` in `h` is limited to one occurrence only? – user1063287 May 02 '13 at 11:04
*h[i] = 1* simply adds *i* to the hash table *h* (actually, you can do the same with a set). – May 02 '13 at 11:13

How to remove duplicate values from a beautiful soup result set whilst preserving order?

1 Answers1