2

I have created a class containing all its instances and need to parallelize the process of instantiation, but cannot solve the problem of sharing the class as a class object. Is it possible in python 2.7 using multiprocessing?

OUTPUT_HEADINGS = []
class MyContainer(object):
    """
    """
    instances = []
    children = []

    @classmethod
    def export_to_csv(cls):
        with open(args.output, "w") as output_file:
            f_csv = csv.DictWriter(output_file, fieldnames=OUTPUT_HEADINGS)
            f_csv.writeheader()
            for instance in cls.instances:
                f_csv.writerow(instance.to_dict())

    def __new__(cls, dat_file):
        try:
            tree = ElementTree.parse(dat_file)
            cls.children = tree.findall("parent_element/child_element")
        except ElementTree.ParseError as err:
            logging.exception(err)

        if not cls.children:
            msg = ("{}: No \"parent_element/child_element\""
                   " element found".format(os.path.basename(dat_file)))
            logging.warning(msg)
            cls.children = []
            return False
        else:
            instance = super(MyContainer, cls).__new__(cls, dat_file)
            instance.__init__(dat_file)
            cls.instances.append(instance)
            cls.children = []
            return True

    def __init__(self, dat_file):
        self._name = os.path.basename(dat_file)
        self.attr_value_sum = defaultdict(list)

        var1 = MyContainer.children[0].find("var1")
        var2 = MyContainer.children[0].get("var2")
        cat_name = "{}.{}".format(var1, var2)

        if cat_name not in OUTPUT_HEADINGS:
            OUTPUT_HEADINGS.append(cat_name)
        # processing and summarizing of xml data

    def to_dict(self):
        return output_dict

def main():
    i = 0
    try:
        for f in FILE_LIST:
            i += 1
            print "{}/{}: {} in progress...".format(i, len(FILE_LIST), f)
            print "{}".format("...DONE" if MyContainer(f) else "...SKIPPED")
    except Exception as err:
        logging.exception(err)
    finally:
        MyContainer.export_to_csv()

if __name__ == '__main__':
    FILE_LIST = []
    for d in args.dirs:
        FILE_LIST.extend(get_name_defined_files(dir_path=d,
                                                pattern=args.filename,
                                                recursive=args.recursive))
    main()

I tried to use multiprocessing.managers.BaseManager, to create a proxy for MyContainer class, but it can only create an instance object this way. I want actually to parallelize the MyContainer(dat_file) call.

Petr Krampl
  • 816
  • 9
  • 11
  • Why did you choose `multiprocessing` over `threading`? – wwii May 25 '18 at 14:56
  • @wwii: Because I need to precess really big amount of data having multicore environment on Win. – Petr Krampl May 25 '18 at 14:58
  • You think (or know) that your *process* is processor bound rather than i/o bound? Are you sure the bottleneck is processing the data and not getting the data from the file system? – wwii May 25 '18 at 15:26
  • Are you actually doing anything else with the instances that are created or are you using `MyContainer` to *compose* the tasks of parsing, processing/summarizing, and writing to csv? – wwii May 25 '18 at 15:48
  • @wwii: The main purpose of the MyContainer is to prepare records for CSV file summarizing values and occurrences of defined elements and possibly its attributes. The next purpose is validation of the XML files. So first creation/initialization of an object representing each file and then export to CSV. I know I can use procedures instead of objects or their combination. I tried to do it more object-oriented ;). This is just general code without details. – Petr Krampl May 26 '18 at 19:29
  • @wwii: Actually, the bottlenecks are both, but the processing XML file really takes some time. They can be MBs large. Filling the FILE_LIST takes minutes (about two) for 90k files. – Petr Krampl May 26 '18 at 19:30
  • 1
    The way you have designed `MyContainer` as a self contained machine I don't think will work in separate processes - the instances won't have a reference to the base so the two class attributes are only *in scope* within each process. You probably need to deconstruct `MyContainer`: put the processing and i/o tasks in functions; use threads to retrieve the raw data, and multiprocessing to process the data. – wwii May 26 '18 at 20:03
  • I have never done anything like this, just read a lot about it. Sounds complicated, lots of data being passed around, you would want to include an identifier, maybe the filename, in the payload of data being passed to the processes/threads and the results being returned. To experiment, I made something similar to `MyContainer` and it behaves well with threads because all the threads are in the same process and can share state, so you could use it with threads but threads won't help with the processor bound tasks.. – wwii May 26 '18 at 20:10

0 Answers0