0

I have a scenario where I am searching through values in a beautiful soup result set and treating them differently depending on their contents, eg:

for i in bs_result_set:
    if 'this unique string' in i.text:
        print 'aaaa'
    else:
        print 'bbbb'

However I have realised that the unique condition actually occurs twice in the result set however I do not need that second replicate value and therefore want to remove it from the result set in the first place.

I have tried approaches to removing duplicate values in a list (whilst preserving order) but these do not seem to work on an object that is a beautiful soup result set. Eg i used logic from this post to try:

from collections import OrderedDict 
OrderedDict.fromkeys(bs_result_set).keys()

But that didn't seem to remove the duplicate values.

So my question is how do i remove duplicate values from a beautiful soup result set whilst preserving order?

Community
  • 1
  • 1
user1063287
  • 10,265
  • 25
  • 122
  • 218
  • What defines a duplicate though? Are the attribute values equal? Or just the attribute names? Should the textual content match exactly or just both have the same substring? What about nested elements? – Martijn Pieters May 02 '13 at 10:56
  • good questions, the values are exact duplicates, they are both a div containing lots of text, html tags and comments. – user1063287 May 02 '13 at 11:01
  • It is interesting then that the `OrderedDict.fromkeys()` trick does not work for you; BS4 `Tag` elements define equality just like that; same name, same attributes (names and values) and same contents (tested recursively). Can you test if `elemA == elemB` is `True` for the elements that you think are duplicates? – Martijn Pieters May 02 '13 at 11:52

1 Answers1

0

What about:

h = {}
for i in bs_result_set:
    if i not in h:
        if 'this unique string' in i.text:
            print 'aaaa'
        else:
            print 'bbbb'
        h[i] = 1

If the key is not i but found from i (computed, field, etc.), you can do

h = {}
for i in bs_result_set:
    key = <some formula involving i>
    if key not in h:
        if 'this unique string' in i.text:
            print 'aaaa'
        else:
            print 'bbbb'
        h[key] = 1
  • what does `h[i] = 1` do? is that somehow saying `i` in `h` is limited to one occurrence only? – user1063287 May 02 '13 at 11:04
  • *h[i] = 1* simply adds *i* to the hash table *h* (actually, you can do the same with a set). –  May 02 '13 at 11:13