I am trying to parse metadata from a GROBID output (parsing academic papers in PDF format). The references look like this
The raw TEI-XML file looks like this (read via soup = read_tei('paper1.tei.xml')
)
<?xml version="1.0" encoding="UTF-8"?><html><body><tei xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd">
<teiheader xml:lang="en">
<filedesc>
<titlestmt>
<title level="a" type="main">Fuel Cell Technology An Annotated Bibliography</title>
</titlestmt>
<publicationstmt>
<publisher></publisher>
<availability status="unknown"><licence></licence></availability>
<date type="published" when="2022-09-05">September 5, 2022</date>
</publicationstmt>
<sourcedesc>
<biblstruct>
<analytic>
<author role="corresp">
<persname><forename type="first">Titus</forename><surname>Barik</surname></persname>
<email>titus@barik.net</email>
<affiliation key="aff0">
<orgname type="institution">Georgia Institute of Technology</orgname>
</affiliation>
</author>
<title level="a" type="main">Fuel Cell Technology An Annotated Bibliography</title>
</analytic>
<monogr>
<imprint>
<date type="published" when="2022-09-05">September 5, 2022</date>
</imprint>
</monogr>
<idno type="MD5">2E695CAEA5E3B30D896FE14E59153667</idno>
</biblstruct>
</sourcedesc>
</filedesc>
<encodingdesc>
<appinfo>
<application ident="GROBID" version="0.7.1" when="2022-09-08T11:25+0000">
<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
<ref target="https://github.com/kermitt2/grobid"></ref>
</application>
</appinfo>
</encodingdesc>
<profiledesc>
<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Here is a bit of text in the middle of the document.</p></div>
</abstract>
</profiledesc>
</teiheader>
<text xml:lang="en">
</text>
<back>
<div type="references">
<listbibl>
<biblstruct xml:id="b0">
<analytic>
<title level="a" type="main">Windows on the world: 2D windows for 3D augmented reality</title>
<author>
<persname><forename type="first">S</forename><surname>Feiner</surname></persname>
</author>
<author>
<persname><forename type="first">B</forename><surname>Macintyre</surname></persname>
</author>
<author>
<persname><forename type="first">M</forename><surname>Haupt</surname></persname>
</author>
<author>
<persname><forename type="first">E</forename><surname>Solomon</surname></persname>
</author>
</analytic>
<monogr>
<title level="m">Proc. UIST'93</title>
<meeting>UIST'93</meeting>
<imprint>
<date type="published" when="1993">1993</date>
<biblscope from="145" to="155" unit="page"></biblscope>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b1">
<analytic>
<title level="a" type="main">What's real about virtual reality</title>
<author>
<persname><forename type="first">F</forename><forename type="middle">P B</forename><genname>Jr</genname></persname>
</author>
</analytic>
<monogr>
<title level="j">IEEE Computer Graphics and Applications</title>
<imprint>
<biblscope unit="volume">19</biblscope>
<biblscope unit="issue">6</biblscope>
<biblscope from="16" to="27" unit="page"></biblscope>
<date type="published" when="1999-12">Nov.-Dec. 1999</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b2">
<analytic>
<title level="a" type="main">Objective ML: An effective object-oriented extension to ML</title>
<author>
<persname><forename type="first">D</forename><surname>Rémy</surname></persname>
</author>
<author>
<persname><forename type="first">J</forename><surname>Vouillon</surname></persname>
</author>
</analytic>
<monogr>
<title level="j">Theory And Practice of Objects Systems</title>
<imprint>
<biblscope unit="volume">4</biblscope>
<biblscope unit="issue">1</biblscope>
<biblscope from="27" to="50" unit="page"></biblscope>
<date type="published" when="1998">1998</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b3">
<analytic>
<title level="a" type="main">Visualizing data mining models</title>
<author>
<persname><forename type="first">K</forename><surname>Thearling</surname></persname>
</author>
<author>
<persname><forename type="first">B</forename><surname>Becker</surname></persname>
</author>
<author>
<persname><forename type="first">D</forename><surname>Decosta</surname></persname>
</author>
</analytic>
<monogr>
<title level="m">Proc. Integration of Data Mining and Data Visualization Workshop</title>
<meeting>Integration of Data Mining and Data Visualization Workshop</meeting>
<imprint>
<date type="published" when="1998">1998</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b4">
<analytic>
<title level="a" type="main">Why no one uses functional languages</title>
<author>
<persname><forename type="first">P</forename><surname>Wadler</surname></persname>
</author>
</analytic>
<monogr>
<title level="j">ACM SIGPLAN Notices</title>
<imprint>
<biblscope unit="volume">33</biblscope>
<biblscope from="23" to="27" unit="page"></biblscope>
<date type="published" when="1998">1998</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b5">
<analytic>
<title level="a" type="main">Object manipulation in virtual environments: Relative size matters</title>
<author>
<persname><forename type="first">Y</forename><surname>Wang</surname></persname>
</author>
<author>
<persname><forename type="first">C</forename><surname>Mackenzie</surname></persname>
</author>
</analytic>
<monogr>
<title level="m">Proc. CHI'99</title>
<meeting>CHI'99</meeting>
<imprint>
<date type="published" when="1999-05">May 1999</date>
</imprint>
</monogr>
</biblstruct>
</listbibl>
</div>
</back>
</tei>
</body></html>
I have a class written which tries to extract the titles of the references
class TEIFile(object):
@property
def reference_titles(self):
reference_data = self.soup.listbibl.find_all('title', type="main")
result = []
for reference in reference_data:
layer1 = reference
result.append(layer1)
return result
which returns
'\'[<title level="a" type="main">Windows on the world: 2D windows for 3D augmented reality</title>, <title level="a" type="main">What\'s real about virtual reality</title>, <title level="a" type="main">Objective ML: An effective object-oriented extension to ML</title>, <title level="a" type="main">Visualizing data mining models</title>, <title level="a" type="main">Why no one uses functional languages</title>, <title level="a" type="main">Object manipulation in virtual environments: Relative size matters</title>]\' '
I am having difficulty now extracting the titles into a list...how can I improve this so I can obtain the title outputs?