I'm a Python beginner, hope my question is not too lenghty, please tell me if I should be more concise for future questions, thank you!
I'm opening a .XHTML file which contains financial data as XML (iXBRL standard). Right now I'm parsing the file with BeautifulSoup4 ("html.parser").
url = r"tk2021.xhtml"
data = open(url, encoding="utf8")
soup = BeautifulSoup(data, "html.parser")
Then I'm creating different lists, which contain all matching tags. I'm using those lists later to iterate and pull out all relevant data from each tag and load it in a pd.DataFrame
ix_nonfraction = soup.find_all({"ix:nonfraction"})
xbrli_unit = soup.find_all({"xbrli:unit"})
This works as expected. What I'm struggling with is the next step.
I'm trying to create another list containing all <xbrli:context>
tags. They have <xbrli:entity>
child-tags, which I need to remove before I create the list. This is how I'm doing that:
for tag in soup("xbrli:entity"):
tag.decompose()
xbrli_context = soup.find_all({"xbrli:context"})
This also works fine, but I can't access the original soup
later in my script (all <xbrli:entity>
tags are missing). Also I read in the BS4 documentation, that "the behavior of a decomposed Tag or NavigableString is not defined and you should not use it for anything". So I thought it would be cleaner to create a new soup2
for this operation, so the original soup
can be used later on.
And here's where I don't understand what's happening: When I create a second soup with a different name soup2 = BeautifulSoup(data, "html.parser")
and use print(soup2.prettify())
it prints nothing. Doing the same with soup
work just fine.
Why does soup2
seem to be empty? How do I handle multiple versions of one soup, so that I can always start with the original soup, if I want to?