0

I am attempting to parse the USPTO bulk files hosted by Google. In doing so I ran across the DTD files. After some research I understand that these files basically define the schema and can determine if the xml is valid according to that. What I don't understand is how these actually help me parse the files. I've seen several blog posts (1, 2) and this paper on how people are using these, but I don't understand how to use them or why.

My Current approach to the parsing is just using Beautiful Soup to find the tags, but if there is a better/more efficient way I'd like to do that.

Here's a small example chunk of my current approach:

def getRefInfo(ref):
  data = {}
  data["Country"] = ref.find("country").text
  data["Date"] = ref.find("date").text
  data["ID"] = ref.find("doc-number").text

  return data 



soup = BeautifulSoup(xml, 'lxml')
bibData= soup.find("us-bibliographic-data-grant")

ref = bibData.find("publication-reference")
if ref != None:
    print getRefInfo(ref)
JasonMArcher
  • 14,195
  • 22
  • 56
  • 52
drowningincode
  • 1,115
  • 1
  • 12
  • 19

1 Answers1

0

You use DTDs to verify that your input is good before sending it down the workflow pipeline. Consider that XML can be sent in fragments and it's a mechanism to guarentee you never process a partial record (unless you really want to)

The difference really comes into play when you deal with pull parsers vs. DOM parsers.

DTDs can also be used to generate 'smart objects' where the XML you read is transformed into a tree of objects that have behaviors.. This is extremely advanced technique that is very poorly supported with most python modules, but it does exist (and is considered by this author an elegant solution for XML manipulation..)

synthesizerpatel
  • 27,321
  • 5
  • 74
  • 91