Parsing TEI-XML with beautiful soup

Question

I am trying to parse metadata from a GROBID output (parsing academic papers in PDF format). The references look like this

The raw TEI-XML file looks like this (read via soup = read_tei('paper1.tei.xml'))

<?xml version="1.0" encoding="UTF-8"?><html><body><tei xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd">
<teiheader xml:lang="en">
<filedesc>
<titlestmt>
<title level="a" type="main">Fuel Cell Technology An Annotated Bibliography</title>
</titlestmt>
<publicationstmt>
<publisher></publisher>
<availability status="unknown"><licence></licence></availability>
<date type="published" when="2022-09-05">September 5, 2022</date>
</publicationstmt>
<sourcedesc>
<biblstruct>
<analytic>
<author role="corresp">
<persname><forename type="first">Titus</forename><surname>Barik</surname></persname>
<email>titus@barik.net</email>
<affiliation key="aff0">
<orgname type="institution">Georgia Institute of Technology</orgname>
</affiliation>
</author>
<title level="a" type="main">Fuel Cell Technology An Annotated Bibliography</title>
</analytic>
<monogr>
<imprint>
<date type="published" when="2022-09-05">September 5, 2022</date>
</imprint>
</monogr>
<idno type="MD5">2E695CAEA5E3B30D896FE14E59153667</idno>
</biblstruct>
</sourcedesc>
</filedesc>
<encodingdesc>
<appinfo>
<application ident="GROBID" version="0.7.1" when="2022-09-08T11:25+0000">
<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
<ref target="https://github.com/kermitt2/grobid"></ref>
</application>
</appinfo>
</encodingdesc>
<profiledesc>
<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Here is a bit of text in the middle of the document.</p></div>
</abstract>
</profiledesc>
</teiheader>
<text xml:lang="en">
</text>
<back>
<div type="references">
<listbibl>
<biblstruct xml:id="b0">
<analytic>
<title level="a" type="main">Windows on the world: 2D windows for 3D augmented reality</title>
<author>
<persname><forename type="first">S</forename><surname>Feiner</surname></persname>
</author>
<author>
<persname><forename type="first">B</forename><surname>Macintyre</surname></persname>
</author>
<author>
<persname><forename type="first">M</forename><surname>Haupt</surname></persname>
</author>
<author>
<persname><forename type="first">E</forename><surname>Solomon</surname></persname>
</author>
</analytic>
<monogr>
<title level="m">Proc. UIST'93</title>
<meeting>UIST'93</meeting>
<imprint>
<date type="published" when="1993">1993</date>
<biblscope from="145" to="155" unit="page"></biblscope>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b1">
<analytic>
<title level="a" type="main">What's real about virtual reality</title>
<author>
<persname><forename type="first">F</forename><forename type="middle">P B</forename><genname>Jr</genname></persname>
</author>
</analytic>
<monogr>
<title level="j">IEEE Computer Graphics and Applications</title>
<imprint>
<biblscope unit="volume">19</biblscope>
<biblscope unit="issue">6</biblscope>
<biblscope from="16" to="27" unit="page"></biblscope>
<date type="published" when="1999-12">Nov.-Dec. 1999</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b2">
<analytic>
<title level="a" type="main">Objective ML: An effective object-oriented extension to ML</title>
<author>
<persname><forename type="first">D</forename><surname>Rémy</surname></persname>
</author>
<author>
<persname><forename type="first">J</forename><surname>Vouillon</surname></persname>
</author>
</analytic>
<monogr>
<title level="j">Theory And Practice of Objects Systems</title>
<imprint>
<biblscope unit="volume">4</biblscope>
<biblscope unit="issue">1</biblscope>
<biblscope from="27" to="50" unit="page"></biblscope>
<date type="published" when="1998">1998</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b3">
<analytic>
<title level="a" type="main">Visualizing data mining models</title>
<author>
<persname><forename type="first">K</forename><surname>Thearling</surname></persname>
</author>
<author>
<persname><forename type="first">B</forename><surname>Becker</surname></persname>
</author>
<author>
<persname><forename type="first">D</forename><surname>Decosta</surname></persname>
</author>
</analytic>
<monogr>
<title level="m">Proc. Integration of Data Mining and Data Visualization Workshop</title>
<meeting>Integration of Data Mining and Data Visualization Workshop</meeting>
<imprint>
<date type="published" when="1998">1998</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b4">
<analytic>
<title level="a" type="main">Why no one uses functional languages</title>
<author>
<persname><forename type="first">P</forename><surname>Wadler</surname></persname>
</author>
</analytic>
<monogr>
<title level="j">ACM SIGPLAN Notices</title>
<imprint>
<biblscope unit="volume">33</biblscope>
<biblscope from="23" to="27" unit="page"></biblscope>
<date type="published" when="1998">1998</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b5">
<analytic>
<title level="a" type="main">Object manipulation in virtual environments: Relative size matters</title>
<author>
<persname><forename type="first">Y</forename><surname>Wang</surname></persname>
</author>
<author>
<persname><forename type="first">C</forename><surname>Mackenzie</surname></persname>
</author>
</analytic>
<monogr>
<title level="m">Proc. CHI'99</title>
<meeting>CHI'99</meeting>
<imprint>
<date type="published" when="1999-05">May 1999</date>
</imprint>
</monogr>
</biblstruct>
</listbibl>
</div>
</back>
</tei>
</body></html>

I have a class written which tries to extract the titles of the references

class TEIFile(object):
    @property
    def reference_titles(self):
        reference_data = self.soup.listbibl.find_all('title', type="main")

        result = []

        for reference in reference_data:
            layer1 = reference

            result.append(layer1)
          

        return result

which returns

'\'[<title level="a" type="main">Windows on the world: 2D windows for 3D augmented reality</title>, <title level="a" type="main">What\'s real about virtual reality</title>, <title level="a" type="main">Objective ML: An effective object-oriented extension to ML</title>, <title level="a" type="main">Visualizing data mining models</title>, <title level="a" type="main">Why no one uses functional languages</title>, <title level="a" type="main">Object manipulation in virtual environments: Relative size matters</title>]\' '

I am having difficulty now extracting the titles into a list...how can I improve this so I can obtain the title outputs?

Do you have an actual source for such a file (not pictures)? — Barry the Platipus, Sep 13 '22 at 13:44
Dont forget to validate/upvote the answer to close your question...SO — Frenchy, Sep 29 '22 at 05:44
This article can help you: https://komax.github.io/blog/text/python/xml/parsing_tei_xml_python/ — lucazav, Feb 08 '23 at 16:50

score 2 · Answer 1 · answered Sep 13 '22 at 14:07

adapt this sample:

from bs4 import BeautifulSoup as bs
content = []
# Read the XML file
with open("sample.xml", "r") as file:
    # Read each line in the file, readlines() returns a list of lines
    content = file.readlines()
    # Combine the lines in the list into a string
    content = "".join(content)
    bs_content = bs(content, "lxml")


result = bs_content.find_all("title")
for t in result:
    print(t.text)

result:

Fuel Cell Technology An Annotated Bibliography
Fuel Cell Technology An Annotated Bibliography
Windows on the world: 2D windows for 3D augmented reality
Proc. UIST'93
What's real about virtual reality
IEEE Computer Graphics and Applications
Objective ML: An effective object-oriented extension to ML
Theory And Practice of Objects Systems
Visualizing data mining models
Proc. Integration of Data Mining and Data Visualization Workshop
Why no one uses functional languages
ACM SIGPLAN Notices
Object manipulation in virtual environments: Relative size matters

Parsing TEI-XML with beautiful soup

1 Answers1