0

I have a very large csv file which looks like this:

Column1;Column2
01;BE
02;ED
12;FD
14;DS
03;ED
04;DF

Now I want to read this csv and depending on certain criteria I would like to export it to different multiple csv files.

My code is as follows:

import csv
output_path=r'C:\myfolder\large_file.csv'

with open(os.path.join(os.path.dirname(output_path),"first_subset_total.csv"), "w", encoding="utf-8", newline='') as \
out_01, open(os.path.join(os.path.dirname(output_path),"excluded_first.csv"), "w", encoding="utf-8", newline='') as \
out_02, open(os.path.join(os.path.dirname(output_path),"pure_subset.csv"), "w", encoding="utf-8", newline='') as \
out_03_a, open(os.path.join(os.path.dirname(output_path),"final_subset.csv"), "w", encoding="utf-8", newline='') as \
out_04_b:
    
    cw01 = csv.writer(out_01, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    cw02 = csv.writer(out_02, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    cw03_a = csv.writer(out_03_a, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    cw04_b = csv.writer(out_04_b, delimiter=";", quoting=csv.QUOTE_MINIMAL)

    with open(output_path, encoding="utf-8") as in_f:
        cr = csv.reader(in_f, delimiter=";")
        header = next(cr) 
        cw01.writerow(header)
        cw02.writerow(header)
        cw03_a.writerow(header)
        cw04_b.writerow(header)

        for line in cr:
            if (line[0][:2] =="01" and ...): cw01.writerow(line)  
            if (line[0][:2] =="02"): cw02.writerow(line)  
            if (line[0][:2] =="03" and ...): cw03_a.writerow(line)  
            if (line[0][:2] =="04" and ...): cw04_b.writerow(line)

Now my problem is first that I have many if statements and more than 04 files. Also some have subset notations like 04_a and 04_b. So now I do it for 04 files, there are way more than 20. Same number of if statements. So many, that I get an SyntaxError: too many statically nested blocks error, because there are more than 20 nested conditions. My current solution is to put the next conditions into a loop again. Not a good solution. This is inefficient. However, I also doubt my coding readiblity and the way I do it in general. So how can I have all this in a more efficient manner?

PSt
  • 97
  • 11

1 Answers1

0

The problem?

So I am not sure I understand your problem. I would assume that originally you went with some kind of if-else nesting that yielded the syntax error and that the solution you present is your fix but is not as efficient as it could be since the conditions in each if are actually mutually exclusive. Meaning that is the first one is true all the rest is false, yet you still check all of them.

Simple solution

If I understood the problem correctly, then the solution is simple, replace your if's by elif. elif is the contraction of else and if (duh...) and allows you to avoid big nested structures est follow:

# ...
for line in cr:
  if (line[0][:2] =="01" and ...): cw01.writerow(line)  
  elif (line[0][:2] =="02"): cw02.writerow(line)  
  elif (line[0][:2] =="03" and ...): cw03_a.writerow(line)  
  elif (line[0][:2] =="04" and ...): cw04_b.writerow(line)

It is true that this is still harder to read, but align your code nicely and this is already pretty acceptable. Although I will admit this leads to a lot of spaghetti code.

More complex solution (rework your code structure)

The way I see it you have actually only 2 parameters that you have to hardcode: your output file names and the related conditions. There is no way to avoid it. If we take a minimalist approach, these should be the only "pieces of spaghetti" in your code. All the other redundant lines of code could be avoided.

So I would start by defining those as a some kind of iterable object at the start of my file and then iterating over this list all through your code, avoiding repeating the same code line 20 times.

I don't think it's relevant for me to rewrite your code but here are a few pointers that will give you some good tools to do it well:

  • Your iterable could be one of the following: nested list, dict, numpy array, data class. I recommand numpy array which is a good compromise between ease of use and flexibility.
  • You can use python lambdas as a way to store your condition in your list.
  • You can use context manager (with) with a variable number of files in only two lines using contexlib.ExitStack as shown in this answer.
  • You can use break to exit the loop once you have found the condition that was true and switch faster to the next line.

So here is the idea:

  1. Write your array with conditions and as lambdas and file names (2-D) array
  2. Use context manager to open all of your output files in 2 lines
  3. Iterate over your open files to get a list of CSV writers
  4. Use a context manager to open your input file (as you already do))
  5. Iterate over your CSV writers to write the header
  6. For each line iterate over your conditions and write the line to a file with the relevant CSV writer (it's easy, it should be the one with the same index as your true condition)
  7. (Optional) For a little bit of extra speed break the condition iteration after having found the one that is true.
LNiederha
  • 911
  • 4
  • 18
  • Regarding your simple solution: it doesnt matter if I use elif instead, as I run into the same problem again that I have more than 20 elif statements. – PSt Jan 11 '23 at 12:48
  • Hum... does it still triggers the error even if they are not nested within eachother? Well if that's the case I'll edit my answer when I have time – LNiederha Jan 11 '23 at 15:56
  • Yes, I tried it with the elif statements (I have many, more than 20) and it does not work. – PSt Jan 11 '23 at 16:01