Could someone recommend the best data structure for the FinalResults
described below:
I'm extracting various pieces of information from XML documents. Roughly, here's what I do: First use find_all to locate the text
elements that contain a keyword. Then for each result:
- get the parent tag for the
text
element - get an attribute of that parent, and
- search the contents of the
text
element for additional text using regex.
This last search yields a result with up to 6 match groups.
This whole operation could end up returning something like this:
FinalResult 1: [parent, parent-attr, match.group(1), match.group(2) ... ,match.group(6)]
FinalResult 2: [parent, parent-attr, match.group(1), match.group(2) ... ,match.group(6)]
There is no maximum number of FinalResults
that I might get. But on average I expect fewer than 10 from each XML doc. I plan to use each FinalResult
for other processing but won't be changing or adding anything in the FinalResults
. For example I might say: go back to the <parent>
with attribute XYZ and get other data, then go get a file by the name of match.group(2)
from elsewhere.
I'll probably be accessing each FinalResult only a few times. If it matters, some of the match.groups could be "None"
Here's an example. Assume this is FinalResult[0]: ['paragraph', '39871234', '42', '103', 'b', '1', None, None]
Paragraph would be the parent tag of the tag containing the keywords I found. 39871234 would be the id attribute of the paragraph tag 42 indicates a volume number 103 is a section in that volume b and 1 are subdivisions of that section
I would use 42/103/b/1 to extract info from another xml file.
Paragraph and the id would be used in case I need to tell one keyword search result from another because the file will have multiple text elements. (Ex. Paragraph id=39871234
text
[string containing keyword]
)
My question is should I store all the FinalResults as a dictionary, a list, a tuple, or something else?