In this answer I am going to use the bixml
library for parsing big XML files.
Disclaimer: I am the author of this library.
I don't know about panda
nor what you are trying to do exactly, but since you mentioned CSV and searching are what you want to do, I show example of how to get some data in a pythonic way out of the XML file, which should be a good starting point to using that data however you want.
For future reference, here is what the XML file looks like:
<?xml version="1.0" encoding="utf-8"?>
<!-- redacted -->
<trademark-assignments>
<!-- redacted -->
<assignment-information>
<assignment-entry>
<assignment>
<reel-no>1</reel-no>
<frame-no>0001</frame-no>
<last-update-date>19910716</last-update-date>
<purge-indicator>N</purge-indicator>
<date-recorded>19550103</date-recorded>
<page-count>0</page-count>
<correspondent>
<person-or-organization-name/>
</correspondent>
<conveyance-text>CHANGE OF NAME 19530513</conveyance-text>
</assignment>
<assignors>
<assignor>
<person-or-organization-name>HAWK AND BUCK COMPANY, INC., THE</person-or-organization-name>
<city>DALLAS</city>
<state>TEXAS</state>
<execution-date>19530513</execution-date>
<legal-entity-text>UNKNOWN</legal-entity-text>
</assignor>
</assignors>
<assignees>
<assignee>
<person-or-organization-name>GRIFFIN, C. C., MANUFACTURING COMPANY</person-or-organization-name>
<city>FORT WORTH</city>
<state>TEXAS</state>
<legal-entity-text>UNKNOWN</legal-entity-text>
</assignee>
</assignees>
<properties>
<property>
<serial-no>71231446</serial-no>
<registration-no>218184</registration-no>
</property>
<property>
<serial-no>71538408</serial-no>
<registration-no>506247</registration-no>
</property>
<property>
<serial-no>71510081</serial-no>
<registration-no>509215</registration-no>
</property>
</properties>
</assignment-entry>
<!-- redacted -->
</assignment-information>
</trademark-assignments>
Let's say you are interested into getting for each assignment entry the city and state of the assignor, as well as all properties (serial-no & registration-no).
from dataclasses import dataclass, field
from pprint import pprint
from typing import List, Optional
from zipfile import ZipFile
from bigxml import Parser, xml_handle_element
@xml_handle_element("assignor")
@dataclass
class Assignor:
city: str = "N/A"
state: str = "N/A"
@xml_handle_element("city")
def handle_city(self, node):
self.city = node.text
@xml_handle_element("state")
def handle_state(self, node):
self.state = node.text
@xml_handle_element("property")
@dataclass
class Property:
serial: Optional[int] = None
registration: Optional[int] = None
@xml_handle_element("serial-no")
def handle_serial(self, node):
self.serial = int(node.text)
@xml_handle_element("registration-no")
def handle_registration(self, node):
self.registration = int(node.text)
@xml_handle_element("trademark-assignments", "assignment-information", "assignment-entry")
@dataclass
class AssignmentEntry:
assignor: Optional[Assignor] = None
properties: List[Property] = field(default_factory=list)
@xml_handle_element("assignors")
def handle_assignors(self, node):
self.assignor = node.return_from(Assignor)
@xml_handle_element("properties")
def handle_properties(self, node):
self.properties.extend(node.iter_from(Property))
with ZipFile("asb19550103-20211231-01.zip") as zip_file:
with zip_file.open("asb19550103-20211231-01.xml") as xml_file:
for assignment_entry in Parser(xml_file).iter_from(AssignmentEntry):
pprint(assignment_entry)
# do whatever you want with assignment_entry here
Running the above code will output all AssignmentEntry
instances:
AssignmentEntry(assignor=Assignor(city='DALLAS', state='TEXAS'),
properties=[Property(serial=71231446, registration=218184),
Property(serial=71538408, registration=506247),
Property(serial=71510081, registration=509215)])
AssignmentEntry(assignor=Assignor(city='JERSEY CITY', state='NEW JERSEY'),
properties=[Property(serial=71230951, registration=217985),
Property(serial=71224781, registration=212380),
Property(serial=71255202, registration=243916),
Property(serial=71486386, registration=420259),
Property(serial=71515236, registration=434974),
Property(serial=71620823, registration=572309)])
AssignmentEntry(assignor=Assignor(city='SANTA BARBARA', state='CALIFORNIA'),
properties=[Property(serial=71564699, registration=542581),
Property(serial=71564406, registration=578399)])
... etc. ...
I chose to use dataclasses to hold the data, but feel free to use other data representation.
For more information on the usage of the bigxml
library, please refer to its documentation.