I am attempting to parse the USPTO bulk files hosted by Google. In doing so I ran across the DTD files. After some research I understand that these files basically define the schema and can determine if the xml is valid according to that. What I don't understand is how these actually help me parse the files. I've seen several blog posts (1, 2) and this paper on how people are using these, but I don't understand how to use them or why.
My Current approach to the parsing is just using Beautiful Soup to find the tags, but if there is a better/more efficient way I'd like to do that.
Here's a small example chunk of my current approach:
def getRefInfo(ref):
data = {}
data["Country"] = ref.find("country").text
data["Date"] = ref.find("date").text
data["ID"] = ref.find("doc-number").text
return data
soup = BeautifulSoup(xml, 'lxml')
bibData= soup.find("us-bibliographic-data-grant")
ref = bibData.find("publication-reference")
if ref != None:
print getRefInfo(ref)