1

I have a general question regarding multiply-nested statements. For "complicated nesting" (>3/4 layers), what is a better approach, especially when iterating AND using if-statements?

I have a lot of files, some of which are in sub-directories, others which are in the root directory. There are a number of directories from which I want to extract datasets and append to a target dataset (the master).

for special_directory in directorylist:
    for dataset in special_directory:
        if dataset in list_of_wanted:
        some_code
        if it_already_exists:
            for feature_class in dataset:
                if feature_class in list_of_wanted:

and then I really get into the meat of the code processing. Frankly, I can't think of a way to avoid these nested conditional and looping statements. Is there something I am missing? Should I be using "while" instead of "for"?

My actual specific code works. It just doesn't move very quickly. It is iterating over 27 databases to append the contents of each to a new target database. My python has been running for 36 hours and is through 4/27. Tips?

I posted this in the GIS stack exchange, but my question is really too general for it to belong there: question and more specific code

Any tips? What are best practices in this regard? This is already a subset of the code. This looks for datasets and feature classes within them, within geodatabses from a list generated from another script. A third script looks for feature classes stored in geodatabases (i.e. not within datasets).

ds_wanted = ["Hydrography"]
fc_wanted = ["NHDArea","NHDFlowline","NHDLine","NHDWaterbody"]

for item in gdblist:
env.workspace = item
for dsC in arcpy.ListDatasets():
    if dsC in ds_wanted:
        secondFD = os.path.join(gdb,dsC)
        if arcpy.Exists(secondFD):
            print (secondFD + " exists, not copying".format(dsC))
            for fcC in arcpy.ListFeatureClasses(feature_dataset=dsC):
               if fcC in fc_wanted:
                   secondFC2 = os.path.join(gdb,dsC, fcC)
                   if arcpy.Exists(secondFC2):
                       targetd2 = os.path.join(gdb,dsC,fcC)
                   # Create FieldMappings object and load the target dataset
                   #
                       print("Now begin field mapping!")
                       print("from {} to {}").format(item, gdb)
                       print("The target is " + targetd2)
                       fieldmappings = arcpy.FieldMappings()
                       fieldmappings.addTable(targetd2)

                       # Loop through each field in the input dataset
                       #

                       inputfields = [field.name for field in arcpy.ListFields(fcC) if not field.required]
                       for inputfield in inputfields:
                       # Iterate through each FieldMap in the FieldMappings
                           for i in range(fieldmappings.fieldCount):
                               fieldmap = fieldmappings.getFieldMap(i)
                    # If the field name from the target dataset matches to a validated input field name
                               if fieldmap.getInputFieldName(0) == inputfield.replace(" ", "_"):
                        # Add the input field to the FieldMap and replace the old FieldMap with the new
                                   fieldmap.addInputField(fcC, inputfield)
                                   fieldmappings.replaceFieldMap(i, fieldmap)
                                   break
                   # Perform the Append
                   #
                       print("Appending stuff...")
                       arcpy.management.Append(fcC, targetd2, "NO_TEST", fieldmappings)
                   else:
                       arcpy.Copy_management(fcC, secondFC2)
                       print("Copied " +fcC+ "into " +gdb)
               else:
                   pass

        else:
            arcpy.Copy_management(dsC,secondFD) # Copies feature class from first gdb to second gdb
            print "Copied "+ dsC +" into " + gdb
    else:
        pass
        print "{} does not need to be copied to DGDB".format(dsC)

print("Done with datasets and the feature classes within them.")

It seems to really get caught on arcpy.management.Append I have some fair experience with this function and despite that this is a larger than typical table schema (more records, more fields), a single append is taking 12+ hours. To build on my original question, could this be because it is so deeply nested? Or is this not the case and the data simply requires time to process?

Community
  • 1
  • 1
Kevin
  • 28
  • 1
  • 7
  • 2
    why don't you break it up into functions? – Chris_Rands Feb 21 '17 at 23:04
  • I am sorry you were pointed to here, perhaps you should've been [redirected to Code Review](http://meta.stackoverflow.com/questions/253975/be-careful-when-recommending-code-review-to-askers) in the first place, since you had your full and working code in the original post. – iled Feb 21 '17 at 23:08
  • 1
    Doesn't look worth the bother to me, I'd be looking to break it up for say processing in parallel, given how big the task is. – Tony Hopkinson Feb 21 '17 at 23:17
  • Is the question about performance or about the if-statements? If it's about performance then you should profile your function, find the bottlenecks, find out why they are bottlenecks (wrong data structure, unnecessary calls?). If it's about the nesting of `if`-statements: Show what code is inside the `if`-statements, otherwise we can't know if it can be reduced, how it could be reduced, .... As the question stands any answer would need to guess and that makes it practically unanswerable. – MSeifert Feb 21 '17 at 23:23
  • What is the aggregate size of the data? If your dataset is > ~1-100GB I wouldn't be at all surprised if it's taking days to run. And that varies largely by the computational requirements on the data. If you aren't bottlenecked by disk I/O, it may be time to write a c extension or a c program to do the number crunching you want. Or use an actual DB tool to do the merge. – TemporalWolf Feb 21 '17 at 23:54
  • I added the code, above. It uses a lot of arcpy (GIS) tools, but their function is pretty obvious from their names if you're not immediately familiar – Kevin Feb 22 '17 at 00:02
  • @Chris_Rands I originally had it broken up into functions, but I thought it would be harder to run each of those if-statements sequentially for different functions than to embed them in a large function. This is, however, the crux of my question – Kevin Feb 22 '17 at 00:05
  • 1
    This will be a negligible speed increase, but you're doing a lot of membership tests over lists which is O(n). If you don't care about the order of the elements of those lists (and it doesn't seem like you do), you could turn them into sets which have O(1) membership tests. – Adam Smith Feb 22 '17 at 00:06
  • Also is `if fieldmap.getInputFieldName(0) == inputfield.replace(" ", "_")` just testing to see if there's a space in `getInputFieldName(0)`? If so, you can use `if " " in fieldmap.getInputFieldName(0)` which will be slightly faster (still O(n), but it skips the `inputfield.replace` transform) – Adam Smith Feb 22 '17 at 00:08
  • As @TonyHopkinson mentions, any improvements in the areas above will be negligible over proper profiling and parallelisation, though. Python is a slow language. If you're doing something many times, or doing something with some really heavy lifting, you're often better off spending more time writing in a language better-suited for that heavy lifting. – Adam Smith Feb 22 '17 at 00:16
  • @Adam Smith Great, simple idea for breaking up the lists. That's something I can approach. – Kevin Feb 22 '17 at 16:07

1 Answers1

1

Some good comments in response to your question. I've limited experience with multiprocessing, but having all your computer cores working will often speed things up. If you have a four-core processor that is only running around 25% during script execution, then you can potentially benefit. You just need to be careful how you apply it in case one thing needs to always happen before another. If you are working with file geodatabases rather than enterprise gdb's, then your bottleneck may be with the disk. If the gdb is remote, network speed may be the issue. Either way, multiprocessing won't help. Resource monitor on Windows will give you a general idea on how much processor/disk/RAM/network is utilized.

I just used a similar script using rpy2 and data from/to PostGIS. It still took ~30 hours to run, but that is much better than 100. I haven't used multiprocessing yet in Arc (I mostly work in open source), but know of people who have.

A very simple implementation of multiprocessing:

from multiprocessing import Pool

def multi_run_wrapper(gdblist):
    """Helper function to unpack argument lists during multiprocessing.
    Modified from: http://stackoverflow.com/a/21130146/4062147"""
    return gdb_append(*gdblist)  # the * unpacks the list

def gdb_append(gdb_id):
    ...

# script starts here #

gdblist = [......]

if __name__ == '__main__':
    p = Pool()
    p.map(multi_run_wrapper, gdblist)

print("Script Complete")

Normally you would join the results of the pools, but since you are using this to execute tasks I'm not sure this is necessary. Somebody else may be able to chime in as to what is best practice.

Nate Wanner
  • 199
  • 2
  • 10
  • Thanks for the hint @Nate Wanner about how to begin to multi-process. It's your perspective then, that the code itself isn't necessarily "inefficient" for the task, but that revising it to fit within a multi-wrapper or breaking up the task into multiple runs is the answer – Kevin Feb 22 '17 at 18:58
  • @Kevin My intent was just to give you a starting point on the multiprocessing. Your script sounds slow to me, but then I'm not really sure what size datasets you have or how they are stored and accessed. I suspect you may be using a tool that copies features one at a time when you could possibly use a tool to copy gdb tables en mass, but that is just hypothesis on my part and I haven't taken the time to brush off ArcPy. Paul Zandbergen's book on Python scripting in Arc was good if you don't have it, yet. – Nate Wanner Feb 22 '17 at 20:43
  • @Kevin A couple untested thoughts: Presuming that the databases you are merging are the same schema, but for different areas, can you avoid the fieldmapping? It seems like that could potentially slow down the Append function. You could also have a big performance hit if the spatial reference changes. You may be better merging into one spatial reference during this script and then changing feature classes en mass, rather than reprojecting within Append. – Nate Wanner Feb 22 '17 at 23:57