I am trying to capture all claims text tax in a bunch of xml patent files but having trouble with tags within the <claim-test>
. Sometimes there's another <claim-text>
and sometimes there is also <claim-ref>
interrupting the text. In my output, the code gets cut off. Usually there are over 10 claims. I am trying to only get the text in the claim text.
I've already looked and tried the following but these don't work: xml elementree missing elements python and How to get all sub-elements of an element tree with Python ElementTree? I've included a snippet here as it does get quite long to capture all.
My code for this is below (where fullname is the file name and directory).
for _, elem in iterparse(fullname):
description = '' # reset to empty string at beginning of each loop
abtext = '' # reset to empty string at beginning of each loop
claimtext= '' # reset to empty string
if elem.tag == 'claims':
for node4 in tree.findall('.//claims/claim/claim-text'):
claimtext = claimtext + node4.text
f.write('\n\nCLAIMTEXT\n\n\n')
f.write(smart_str(claimtext) + '\n\n')
#put row in df
row = dict(zip(['PATENT_ID', 'CLASS', 'ABSTRACT', 'DESCRIPTION','CLAIMS'], [data,cat,abtext,description,claimtext]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
So the resulting problem is twofold a) I only get one of the text printed to fil and b) nothing comes into the dataframe at all. I'm not sure if that's part of the same problem or two separate problems. I can get the claims to print into a file and that works fine but skips some of the text.