Why does soup=BeautifulSoup(data, "html.parser") work but soup2=... does not?

Question

I'm a Python beginner, hope my question is not too lenghty, please tell me if I should be more concise for future questions, thank you!

I'm opening a .XHTML file which contains financial data as XML (iXBRL standard). Right now I'm parsing the file with BeautifulSoup4 ("html.parser").

url = r"tk2021.xhtml"
data = open(url, encoding="utf8")

soup = BeautifulSoup(data, "html.parser")

Then I'm creating different lists, which contain all matching tags. I'm using those lists later to iterate and pull out all relevant data from each tag and load it in a pd.DataFrame

ix_nonfraction = soup.find_all({"ix:nonfraction"})
xbrli_unit = soup.find_all({"xbrli:unit"})

This works as expected. What I'm struggling with is the next step.

I'm trying to create another list containing all <xbrli:context> tags. They have <xbrli:entity> child-tags, which I need to remove before I create the list. This is how I'm doing that:

for tag in soup("xbrli:entity"):
    tag.decompose()

xbrli_context = soup.find_all({"xbrli:context"})

This also works fine, but I can't access the original soup later in my script (all <xbrli:entity> tags are missing). Also I read in the BS4 documentation, that "the behavior of a decomposed Tag or NavigableString is not defined and you should not use it for anything". So I thought it would be cleaner to create a new soup2 for this operation, so the original soup can be used later on.

And here's where I don't understand what's happening: When I create a second soup with a different name soup2 = BeautifulSoup(data, "html.parser") and use print(soup2.prettify()) it prints nothing. Doing the same with soup work just fine.

Why does soup2 seem to be empty? How do I handle multiple versions of one soup, so that I can always start with the original soup, if I want to?

`data` is an open file object. Once you've read it - there's nothing more to read. You either need to reopen the file, or rewind it to the beginning with `data.seek(0)`. — jasonharper, Mar 30 '23 at 14:28
Ah, thanks to all of you! I need to read about what exactly open file objects are and how they work -- thank you! Regarding my question for best practice, would you agree that creating different soups for different operations are a good idea? — henri.haiti, Mar 30 '23 at 14:32

Driftr95 · Answer 1 · 2023-03-31T17:15:31.963

0

As already mentioned in the comments, since data is file object, after it's been read by BeautifulSoup the first time, it needs to be re-opened before being read again. You probably wouldn't have that issue if you had used

with open(url, encoding="utf8") as f:
    data = f.read()

since .read() returns a string, so data would just be a string.

You can also just do away with data entirely and use

# soup = BeautifulSoup(open(url, encoding="utf8"), "html.parser") ## less safe
with open(url, encoding="utf8") as f: soup = BeautifulSoup(f, "html.parser")

^{Btw, it's better to use with open, since a bare open should be followed by .close later [but you can't do that if you do it like in the commented line].}

edited Mar 31 '23 at 17:15

answered Mar 31 '23 at 05:21

Driftr95

4,572
2
9
21

Hmm, I tested it but it doesn't work or I'm missing something else. Creating `soup` and `soup2` by opening the file like you propose and calling `data` for each returns empty when I `print(soup2)`. When I create a seperate `data2` for `soup2` both `print(soup)` and `print(soup2)` contain XHTML-Code. So this is what I'm doing for now, doesn't look nice but it works – henri.haiti Mar 31 '23 at 15:16
@henri.haiti How strange - I [can't replicate](https://i.stack.imgur.com/4kPKs.png) such behavior at all, which I rather expected since a string object *shouldn't* empty out just like that... Are you sure you're not re-defining `data` anywhere in your code before trying to create `soup2`? Also, it's not *that* bad to have `data2`, but you could do away with `data` entirely if it only works the 1st time and just have pass `open...` to BeautifulSoup [I added an edit to my answer]. – Driftr95 Mar 31 '23 at 17:15
Ah, nice! This is a cleaner looking way to create `soup` without multiple `data` and does the same. This gets an entry in my Python cheat sheet document, thanks! – henri.haiti Mar 31 '23 at 17:24

score 0 · Answer 2 · answered Mar 31 '23 at 14:36

0

I do not recommend reading an Inline XBRL file at the level of XML or XHTML. Rather, it is highly recommended to use an XBRL processor, which will provide the XBRL semantics at the right level of abstraction.

The XBRL data model is based on data cubes, and by reading the data directly as XML, you are essentially re-building an XBRL processor from scratch.

For example, there is an open-source processor called Arelle and available in Python:

https://pypi.org/project/arelle/

Main project page: https://arelle.org/arelle/

answered Mar 31 '23 at 14:36

Ghislain Fourny

6,971
1
30
37

Thanks for the warning, you're absolutely right! But I tried Arelle and other XBRL-parsers and couldn't get them to work. So I figured if I have to understand XBRL better first to make those parsers work, I could just try to get as far as possible with writing my own parser and learning Python and XBRL while I'm doing that. Values, units, contextrefs, decimals and scales are working already without errors (tested it with 40 reports). Still struggling with label mapping but I'm getting there ;) But I'm sure there will be a point where I just switch to Arelle and call it a day – henri.haiti Mar 31 '23 at 15:24

Why does soup=BeautifulSoup(data, "html.parser") work but soup2=... does not?

2 Answers2