-1

Consider:

>>> result = requests.get('http://dotancohen.com')
>>> soup = BeautifulSoup(result.text)
>>> a = soup.find('a')
>>> for k,v in a.__dict__.items():
...     print(str(k)+": "+str(v))
... 
can_be_empty_element: False
previous_element: <h1><a class="title" href="/">Dotan Cohen</a></h1>
next_sibling: None
name: a
parent: <h1><a class="title" href="/">Dotan Cohen</a></h1>
namespace: None
prefix: None
previous_sibling: None
attrs: {'href': '/', 'class': ['title']}
next_element: Dotan Cohen
parser_class: <class 'bs4.BeautifulSoup'>
hidden: False
contents: ['Dotan Cohen']
>>> pprint(a)
<a class="title" href="/">Dotan Cohen</a>
>>>

The value that pprint returns is not the value of any of the attributes that __dict__.items() returns. That means to me that there exist attributes of a that are not returned in __dict__.items(). How might I access those attributes?

jamylak
  • 128,818
  • 30
  • 231
  • 230
dotancohen
  • 30,064
  • 36
  • 138
  • 197
  • 2
    Why are you assuming the `str()` representation should match the instance attributes? `attrs` is there, as well as `contents` and `name`, so everything you see in the string representation can be found in the instance attributes as well. – Martijn Pieters Jun 13 '13 at 09:22
  • @MartijnPieters: `repr` rather than `str`, but your point stands! – Tom Anderson Jun 13 '13 at 09:24
  • @MartijnPieters: I agree that everything seen in the string representation can be found in the instance attributes. However notice that the information is in the attributes `previous_element` and `parent`. The actual content of the tag itself is not shown. However, it must be stored _somewhere_ as `pprint()` finds it! So why isn't it returned in `__dict__.items()`? – dotancohen Jun 13 '13 at 10:22
  • Yes, it is shown; the contents of the tag is `Dotan Cohen` and is in the `.contents` attribute. The `parent` and `previous_element` tags are representations of *those* elements, so shown as HTML strings as well. – Martijn Pieters Jun 13 '13 at 10:22
  • Why the downvote? How could I improve the question? – dotancohen Jun 13 '13 at 10:56

1 Answers1

2

There are no attributes missing in the instance dictionary. Let's take a look at the representation of the element:

<a class="title" href="/">Dotan Cohen</a>

We have a tag name (a), attributes (title and href, with values) and we have textual content (Dotan Cohen). These are all present in the instance attributes you listed:

  • name: a
  • attrs: {'href': '/', 'class': ['title']}
  • contents: ['Dotan Cohen']

contents is a list of direct descendants of this element; there is only one, a text object (NavigableString instances use a representation that looks just like a regular string).

You could use the vars() built-in API function to list instance attributes. I see you are using pprint() already; rather than loop over .items(), just use pprint(vars(a)) and save yourself typing a full loop; as a bonus pprint() sorts the keys first:

>>> pprint(vars(a))
{'attrs': {'class': ['title'], 'href': '/'},
 'can_be_empty_element': False,
 'contents': [u'Dotan Cohen'],
 'hidden': False,
 'name': 'a',
 'namespace': None,
 'next_element': u'Dotan Cohen',
 'next_sibling': None,
 'parent': <h1><a class="title" href="/">Dotan Cohen</a></h1>,
 'parser_class': <class 'bs4.BeautifulSoup'>,
 'prefix': None,
 'previous_element': <h1><a class="title" href="/">Dotan Cohen</a></h1>,
 'previous_sibling': None}

The string you are looking at is built by the .__repr__() hook of the element class:

>>> a.__repr__()
'<a class="title" href="/">Dotan Cohen</a>'

which normally is called when repr() is used on an object:

>>> repr(a)
'<a class="title" href="/">Dotan Cohen</a>'

The string is built up from the parsed element information you see in the objects attributes.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • From what I understand by your explanation, the actual returned value for the tag is not stored anywhere? That means that the value returned by `pprint()` must be made by some sort of `ToString()` method? I can confirm by looking at the source code of the website parsed that the order of the attributes in the HTML are not the same order as the attributes in the string returned by `pprint()`. – dotancohen Jun 13 '13 at 10:28
  • 1
    Yes, you are looking at the `repr()` result of the object. The `__repr__` method takes care of building this from the attribute data. HTML attributes are not ordered (like Python dictionaries). – Martijn Pieters Jun 13 '13 at 10:32
  • I see, thank you Martijn. From googling a bit I see that there exists a `dir()` method that will return all the 'names' of `a`, one of which is `__repr__`. `a.__repr__` does return `Dotan Cohen>`. However, I am having a hard time finding a definition for the word 'names' other than "names are: variables, modules, functions, etc.". – dotancohen Jun 13 '13 at 11:04
  • @dotancohen: see [what's the biggest difference between dir and \_\_dict\_\_ in python](http://stackoverflow.com/a/14361362) – Martijn Pieters Jun 13 '13 at 11:05