0

Could someone recommend the best data structure for the FinalResults described below:

I'm extracting various pieces of information from XML documents. Roughly, here's what I do: First use find_all to locate the text elements that contain a keyword. Then for each result:

  1. get the parent tag for the text element
  2. get an attribute of that parent, and
  3. search the contents of the text element for additional text using regex.

This last search yields a result with up to 6 match groups.

This whole operation could end up returning something like this:

FinalResult 1: [parent, parent-attr, match.group(1), match.group(2) ... ,match.group(6)]

FinalResult 2: [parent, parent-attr, match.group(1), match.group(2) ... ,match.group(6)]

There is no maximum number of FinalResults that I might get. But on average I expect fewer than 10 from each XML doc. I plan to use each FinalResult for other processing but won't be changing or adding anything in the FinalResults. For example I might say: go back to the <parent> with attribute XYZ and get other data, then go get a file by the name of match.group(2) from elsewhere.

I'll probably be accessing each FinalResult only a few times. If it matters, some of the match.groups could be "None"

Here's an example. Assume this is FinalResult[0]: ['paragraph', '39871234', '42', '103', 'b', '1', None, None]

Paragraph would be the parent tag of the tag containing the keywords I found. 39871234 would be the id attribute of the paragraph tag 42 indicates a volume number 103 is a section in that volume b and 1 are subdivisions of that section

I would use 42/103/b/1 to extract info from another xml file. Paragraph and the id would be used in case I need to tell one keyword search result from another because the file will have multiple text elements. (Ex. Paragraph id=39871234 text [string containing keyword] )

My question is should I store all the FinalResults as a dictionary, a list, a tuple, or something else?

OutThere
  • 467
  • 2
  • 8
  • 19
  • XML is naturally a tree. And python has [`ElementTree` class](https://docs.python.org/2/library/xml.etree.elementtree.html) – OneCricketeer Feb 07 '16 at 19:37
  • 2
    Additionally, [don't use regex on XML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – OneCricketeer Feb 07 '16 at 19:41
  • Please shed a bit more light on the semantics of the information you want to store in your data structure. – Genti Saliu Feb 07 '16 at 19:44
  • 2
    Cricket, your first comment doesn't address the question. And regarding your second, I'm not using regex on XML. I'm using it on text extracted from a node. Regardless, neither of your comments are valid reasons to vote down the question. – OutThere Feb 08 '16 at 00:56
  • Genti, am I correct that you're asking for a sample of the information I'll be storing? if so, I've added one to the question. – OutThere Feb 08 '16 at 01:00
  • _"go back to the with attribute XYZ and get other data"_ if you use an `lxml` document, its xpath implementation would make that easy. – tdelaney Feb 08 '16 at 06:06

1 Answers1

1

A real data structure recommendation question would have some actual requirements for what the data structure should do, or help you achieve. In the absence of any such information in your question, I guess the simple and straightforward answer you are looking for is this:

In any modern object-oriented language, the standard way to represent a collection of related attributes is to create a simple class with getter and maybe setter methods (except if the objects are immutable after creation, in which the only way to set an attribute is when you first instantiate its containing object).

Your example suggests a class with attribute(), parent_attribute(), and matches() methods, where the first two would apparently return simple strings, and the last, a list of strings. Your main program would probably have one or more lists of these objects, or perhaps a dictionary keyed on a feature you want to use for accessing previous objects (the identifying attribute?)

class Match (object):
    def __init__ (self, attrib, parent_attr, matches):
        self.attrib, self.parent_attr, self.matches = attrib, parent_attr, matches

    def attribute (self):
        return self.attrib

    def parent_attribute (self):
        return self.parent_attr

    def matches (self):
        return self.matches

The benefit over a list should be immediately obvious: instead of match[0] your code says match.attribute() which immediately conveys what is going on.

The benefit over a dict is less obvious, but is frequent enough in practice that you want to be prepared for it: When you want to refactor the code, changing the class implementation is a much simpler task than changing every place your code uses one of these instances.

So for example, if for some odd reason you realize you wanted to use a list after using the class for a while, you would only have to change the initialization code and the getters, rather than every piece of code which manipulates these instances; and the details of the list implementation behind the scenes would be completely transparent to any code using this class.

There are many additional benefits of modular design; find a good intro to OOP if you want more details.

If the performance of this design is not satisfactory, a new question with some actual requirements (speed, memory, etc) might be the way to go.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I'm new at this. What kind of actual requirements are you thinking of? I have no idea what kind of speed and memory would be appropriate for my task. I assume "fast" and "as little as possible". But I know that's not much of an answer – OutThere Feb 08 '16 at 01:07
  • It's like "what car should I buy for a family of six". If you could say that one or more of low gas consumption, hybrid engine, wheelchair accessibility, four wheel drive, speed, or a number of other factors were more important than some others, that would reduce the number of viable options, but without any such constraints, it's pretty much "any one you like", including minibus models. At least the "family" requirement excludes long-haul trucks and motorcycles. – tripleee Feb 08 '16 at 05:48
  • Anyway, the key here is really "if this is good enough, don't complicate matters". Actual requirements would surface if you find that your program is too slow or uses too much memory to be practical, or if your code is convoluted because the data model came out wrong for your needs. – tripleee Feb 08 '16 at 06:01