0

File Link: https://bulkdata.uspto.gov/data/trademark/dailyxml/assignments/asb19550103-20211231-01.zip

I needed to search for some data inside this file, but I can't read this file directly into the python program because of large size.

I have tried extracting the data from this XML file using various methods, but it seems that it doesn't have the proper XML format or some other issue. Also, I don't exactly about the columns inside this file because I can open it. I am adding all the codes that I had tried.

     import xml.etree.ElementTree as ET
     import pandas as pd

     xml_data = open("/content/apc220822.xml", 'r').read()  # Read file
     root = ET.XML(xml_data)  # Parse XML

     data = []
     cols = []
     for i, child in enumerate(root):
         data.append([subchild.text for subchild in child])
         cols.append(child.tag)

     df = pd.DataFrame(data).T  # Write in DF and transpose it
     df.columns = cols  # Update column names
     print(df)

     df = pd.DataFrame(data).T  # Write in DF and transpose it
     df.columns = cols  # Update column names
     print(df)

Required Solution: I want to extract data from this XML and save it to Pandas DataFrame/CSV format so I can search the data using python from it. I do have 64GB RAM available. So just suggest to me any other method like saving this XML file data to a database etc, from where I can read it directly into the python program and find required data inside of it.

1 Answers1

0

In this answer I am going to use the bixml library for parsing big XML files.

Disclaimer: I am the author of this library.

I don't know about panda nor what you are trying to do exactly, but since you mentioned CSV and searching are what you want to do, I show example of how to get some data in a pythonic way out of the XML file, which should be a good starting point to using that data however you want.


For future reference, here is what the XML file looks like:

<?xml version="1.0" encoding="utf-8"?>
<!-- redacted -->
<trademark-assignments>
    <!-- redacted -->
    <assignment-information>
        <assignment-entry>
            <assignment>
                <reel-no>1</reel-no>
                <frame-no>0001</frame-no>
                <last-update-date>19910716</last-update-date>
                <purge-indicator>N</purge-indicator>
                <date-recorded>19550103</date-recorded>
                <page-count>0</page-count>
                <correspondent>
                    <person-or-organization-name/>
                </correspondent>
                <conveyance-text>CHANGE OF NAME 19530513</conveyance-text>
            </assignment>
            <assignors>
                <assignor>
                    <person-or-organization-name>HAWK AND BUCK COMPANY, INC., THE</person-or-organization-name>
                    <city>DALLAS</city>
                    <state>TEXAS</state>
                    <execution-date>19530513</execution-date>
                    <legal-entity-text>UNKNOWN</legal-entity-text>
                </assignor>
            </assignors>
            <assignees>
                <assignee>
                    <person-or-organization-name>GRIFFIN, C. C., MANUFACTURING COMPANY</person-or-organization-name>
                    <city>FORT WORTH</city>
                    <state>TEXAS</state>
                    <legal-entity-text>UNKNOWN</legal-entity-text>
                </assignee>
            </assignees>
            <properties>
                <property>
                    <serial-no>71231446</serial-no>
                    <registration-no>218184</registration-no>
                </property>
                <property>
                    <serial-no>71538408</serial-no>
                    <registration-no>506247</registration-no>
                </property>
                <property>
                    <serial-no>71510081</serial-no>
                    <registration-no>509215</registration-no>
                </property>
            </properties>
        </assignment-entry>
        <!-- redacted -->
    </assignment-information>
</trademark-assignments>

Let's say you are interested into getting for each assignment entry the city and state of the assignor, as well as all properties (serial-no & registration-no).

from dataclasses import dataclass, field
from pprint import pprint
from typing import List, Optional
from zipfile import ZipFile

from bigxml import Parser, xml_handle_element


@xml_handle_element("assignor")
@dataclass
class Assignor:
    city: str = "N/A"
    state: str = "N/A"

    @xml_handle_element("city")
    def handle_city(self, node):
        self.city = node.text

    @xml_handle_element("state")
    def handle_state(self, node):
        self.state = node.text


@xml_handle_element("property")
@dataclass
class Property:
    serial: Optional[int] = None
    registration: Optional[int] = None

    @xml_handle_element("serial-no")
    def handle_serial(self, node):
        self.serial = int(node.text)

    @xml_handle_element("registration-no")
    def handle_registration(self, node):
        self.registration = int(node.text)


@xml_handle_element("trademark-assignments", "assignment-information", "assignment-entry")
@dataclass
class AssignmentEntry:
    assignor: Optional[Assignor] = None
    properties: List[Property] = field(default_factory=list)

    @xml_handle_element("assignors")
    def handle_assignors(self, node):
        self.assignor = node.return_from(Assignor)

    @xml_handle_element("properties")
    def handle_properties(self, node):
        self.properties.extend(node.iter_from(Property))


with ZipFile("asb19550103-20211231-01.zip") as zip_file:
    with zip_file.open("asb19550103-20211231-01.xml") as xml_file:
        for assignment_entry in Parser(xml_file).iter_from(AssignmentEntry):
            pprint(assignment_entry)
            # do whatever you want with assignment_entry here

Running the above code will output all AssignmentEntry instances:

AssignmentEntry(assignor=Assignor(city='DALLAS', state='TEXAS'),
                properties=[Property(serial=71231446, registration=218184),
                            Property(serial=71538408, registration=506247),
                            Property(serial=71510081, registration=509215)])
AssignmentEntry(assignor=Assignor(city='JERSEY CITY', state='NEW JERSEY'),
                properties=[Property(serial=71230951, registration=217985),
                            Property(serial=71224781, registration=212380),
                            Property(serial=71255202, registration=243916),
                            Property(serial=71486386, registration=420259),
                            Property(serial=71515236, registration=434974),
                            Property(serial=71620823, registration=572309)])
AssignmentEntry(assignor=Assignor(city='SANTA BARBARA', state='CALIFORNIA'),
                properties=[Property(serial=71564699, registration=542581),
                            Property(serial=71564406, registration=578399)])
... etc. ...

I chose to use dataclasses to hold the data, but feel free to use other data representation.

For more information on the usage of the bigxml library, please refer to its documentation.

rogdham
  • 193
  • 8