2

Disclaimer:

  1. I might be asking this question "the wrong way" - I'm still new to programming, so I might be lacking the correct terminology and understanding to ask the question properly. Sorry about that if I am.
  2. I know you can't parse [X]HTML with regex. I'm not parsing it. I'm just matching a string to get the count of that string's line.
  3. I've tried lxml and iter(), as well as iterparse(), and I've tried it this way and this way. It didn't work for me. I'm doing it this way.
  4. I know the code here isn't efficient (specifically compiling the regex expression each time the function is called) - I've re-written it like this to provide the "minimum needed code" to ask the question, plus it makes it a little easier to understand what I'm trying to achieve I think.

With that all out the way; this code works and does what I want it to do, which is basically to take a huge set of xml records and turning it into a dictionary in the form of { record_id : {element_tag : element_val, element_tag : element_val, ...} }.

Since the record_id is nested within the record itself, I first use the record_splitter to identify the count of <RECORD> and </RECORD> (the beg and end). I pass beg and end into dic_factory which finds these positions, does something with the elements nested within that, then sets 'beg' to end +1, and passes it back into the record_splitter function.

My question is whether there is a better way to implement this loop between functions, preferably under if __name__ == '__main__', so that I can move more of my constants (e.g. the regex statements) under that as well.

Code:
    # stored as a text file, but here for clarity

    list_of_lines = """
    <RECORD>
      <TITLE>MISS</TITLE>
      <NAME>ELIZABETH</NAME>
      <SURNAME>II</SURNAME>
      <ADDRESS1>1 BUCKINGHAM PALACE</ADDRESS1>
      <ADDRESS2>LONDON</ADDRESS2>
      <ADDRESS3>GREATER LONDON</ADDRESS3>
      <POST_CODE>W1 11A</POST_CODE>
      <CASE_NUM>Q1QQ1234</CASE_NUM>
      <ID>32145698</ID>
      <LAST_UPDATE_DATE>2016-12-12</LAST_UPDATE_DATE>
     </RECORD>
     <RECORD>
      <TITLE>MR</TITLE>
      <NAME>PRINCE</NAME>
      <SURNAME>PHILLIP</SURNAME>
      <ADDRESS1>1 BUCKINGHAM PALACE</ADDRESS1>
      <ADDRESS2>LONDON</ADDRESS2>
      <ADDRESS3>GREATER LONDON</ADDRESS3>
      <POST_CODE>W1 11A</POST_CODE>
      <CASE_NUM>K5KK4321</CASE_NUM>
      <ID>56987412</ID>
      <LAST_UPDATE_DATE>2017-01-16</LAST_UPDATE_DATE>
     </RECORD>
     <RECORD>
    """

class recordManager:

    def __init__(self):
        self.r_location = "list_of_lines.txt"

    def record_splitter(self, beg):

        re_beg_spl = re.compile(".*<RECORD>")
        re_end_spl = re.compile(".*(<\\/RECORD>)")

        end = None

        for count, line in enumerate( open(self.r_location) ):
            if count > beg:
                if re_end_spl.match(line):
                    end = count

                    if not re_end_spl.match(line):
                        if re_beg_spl.match(line):
                            beg = count
                    else:
                        break

                    recordManager.dic_factory(self, beg, end)


    def dic_factory(self, beg, end):

        re_casenum = re.compile(".*<CASE_NUM>(.*)<\\/CASE_NUM>")
        re_tag_val = re.compile(".*<(\\w*)>(.*)<.*")

        id_ = None
        tags = []
        vals = []

        for count, line in enumerate( open(self.r_location) ):

            if beg < count < end:
                if re_casenum.match(line):
                    m = re_casenum.match(line)
                    id_ = m.group(1)

                if re_tag_val.match(line):
                    m = re_tag_val.match(line)
                    tags.append( m.group(1) )
                    vals.append( m.group(2) )

        beg = end +1
        print {id_ : dict(zip(tags, vals)) }
        # {32145698 : {'POST_CODE': 'W1 11A', 'SURNAME': 'II', 'NAME': 'ELIZABETH', 'TITLE': 'MISS', 'ADDRESS1': '1 BUCKINGHAM PALACE', 'ADDRESS2': 'LONDON', 'ADDRESS3': 'GREATER LONDON', 'RECORD_TYPE': '1', 'CASE_NUM': 'Q1QQ1234', 'LAST_UPDATE_DATE': '2016-12-12', 'ID': '32145698'}}

        self.record_splitter(beg)


if __name__ == '__main__':
    inst_fol = record_manager(file)
    recordManager.record_splitter(inst_folder, 0)

My problem is that I don't know how to loop 'beg' back into record_splitter outside of the functions / from main:

if __name__ == '__main__':
    inst_fol = record_manager(file)
    beg, end = recordManager.record_splitter(inst_folder, 0)

Is this "looping from within functions" and if not, what's the better approach?

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
ron_g
  • 1,474
  • 2
  • 21
  • 39
  • To clarify: are you asking how to get the variable `beg` back into the the function `record_splitter` *after* the initial call in `__main__` ? – Moshe Bildner May 23 '18 at 14:39
  • That is correct - I want it to do this until it fails (i.e. until it has processed all of the lines within `list_of_lines.txt`) – ron_g May 23 '18 at 14:41
  • 1
    With your end goal of putting xml into dict, you might check out: https://pypi.org/project/xmltodict/ , https://stackoverflow.com/questions/2148119/how-to-convert-an-xml-string-to-a-dictionary-in-python , http://docs.python-guide.org/en/latest/scenarios/xml/ – Ywapom May 23 '18 at 14:56
  • Don't parse xml like a string! Learn how to use lxml or xml.etree to parse it and then build your dict. [See here for some ideas](https://stackoverflow.com/questions/7684333/converting-xml-to-dictionary-using-elementtree/10076823) – Igl3 May 23 '18 at 14:57
  • @Ywapom that looks interesting, I will give it a shot. As long as it doesn't try and load the whole file into memory it might work. Thanks. @Igle - I've given lxml a shot (see point 3 of the disclaimer). It loads the whole file into memory. Didn't work for me. Tried `iterparse()` with `clear()`. Also didn't work for me. – ron_g May 23 '18 at 15:15

2 Answers2

1

tl;dr: replace beg with self.beg

My recommendation would be to make beg a field on your class. Remember, each method you are using has access to self, which is part of the advantage to organizing this work into a class in the first place:

so __init__ becomes:

    def __init__(self):
    self.r_location = "list_of_lines.txt"
    self.beg = 0

Now everywhere where you use beg you refer to it as self.beg.

For example:

        for count, line in enumerate( open(self.r_location) ):
        if count > self.beg:
            if re_end_spl.match(line):
                end = count

                if not re_end_spl.match(line):
                    if re_beg_spl.match(line):
                        self.beg = count
                else:
                    break

                recordManager.dic_factory(self, self.beg, end)

Note that there is further opportunity for consolidation: dic_factory takes beg as an argument, but it too has access to the object self, and could simply read beg off of that field.

Moshe Bildner
  • 461
  • 5
  • 9
  • Let me try this and come back to you. I am sure I've tried something of the like earlier, but there was an issue with changing the value of [self.beg](https://stackoverflow.com/questions/14671218/python-class-methods-changing-self) – ron_g May 23 '18 at 15:04
1

I'm going to suggest you use nested loops on the same iterator instead of separate functions. (details in comments)..

class recordManager:

    def __init__(self):
        self.r_location = "list_of_lines.txt"
        self.records = [] #perhaps we want to store all the records we find?

    def find_records(self):

        re_beg_spl = re.compile(".*<RECORD>")
        re_end_spl = re.compile(".*(<\\/RECORD>)")
        re_casenum = re.compile(".*<CASE_NUM>(.*)<\\/CASE_NUM>") #move these up here since we're rolling the two functions into one
        re_tag_val = re.compile(".*<(\\w*)>(.*)<.*")

        with open(self.r_location) as f: #use the "with open() as f:" idiom to ensure file gets closed when you're done with it
            for line in f: #iterating over an open file defaults to line by line and we won't need line number with the nested loop approach
                if re_beg_spl.match(line): #start a new record
                    tags = {} #container for element tags and values
                    case = '' #case number for this record
                    for line in f: #re-use iterator for inner loop so lines are consumed (outer loop will pick up where inner loop leaves off)
                        match_case = re_casenum.match(line)
                        match_elem = re_tag_val.match(line)
                        match_end = re_end_spl.match(line)
                        if match_case:
                            case = match_case.group(1) #are match groups 1 or 0 indexed? I didn't look it up..
                        elif match_elem:
                            tags[match_elem.group(1)] = match_elem.group(2)
                        elif match_end: #end of a record -- save info and break out of inner loop
                            caseinfo = {case: tags}
                            print(caseinfo)
                            self.records.append(caseinfo)
                            break

Using this approach limits your need for effectively global variables (using an attribute of an object to pass state) which enforces more structure on your program (and keeps it simpler). This also eliminates re-looping over lines you've already examined, and re-reading files multiple times (which may be your speed limiting factor if you have a mechanical hard drive). I have not included checks for any possible break in the structure (missing end tag or id tag for example) but they could be added in a fairly straightforward manner.

Aaron
  • 10,133
  • 1
  • 24
  • 40
  • Damn that's fast, elegant solution, and I've learnt something new (nesting loops). Thank you very much. Would you mind explaining this line please - `tags[match_elem.group(1)] = match_elem.group(2)`? – ron_g May 24 '18 at 08:24
  • 1
    @rong [Here's](http://www.pythonforbeginners.com/dictionary/how-to-use-dictionaries-in-python) a quick primer on python dictionaries. `tags` starts out as an empty dictionary each time a new record is started: `tags = {}`. I then use the first group of your `tag_val` regex as the key and the second group as the value for a new dictionary entry. Basically I'm incrementally adding values to the output dictionary rather than collecting the keys and values separately then creating the dict at the end with `dict(zip(tags, vals))` as you did in your first example. – Aaron May 24 '18 at 13:59