0

I am checking the position of semicolons in text files. I have length-delimited text files having thousands of rows which look like this:

AB;2;43234;343;
CD;4;41234;443;
FE53234;543;
FE;5;53;34;543;

I am using the following code to check the correct position of the semicolons. If a semicolon is missing where I would expect it, a statement is printed:

import glob

path = r'C:\path\*.txt'

for fname in glob.glob(path):
    print("Checking file", fname)
    with open(fname) as f:
        content = f.readlines()
        for count, line in enumerate(content):
            if (line[2:3]!=";" 
                or line[4:5]!=";" 
                or line[10:11]!=";"
               # really a lot of continuing entries like these
                or line[14:15]!=";"
                ):
                print("\nSemikolon expected, but not found!\nrow:", count+1, "\n", fname, "\n", line)

The code works. No error is thrown and it detects the data row.

My problem now is that I have a lot of semicolons to check and I have really a lot of continuing entries like

or line[xx:xx]!=";"

I think this is inefficient regarding two points:

  1. It is visually not nice to have these many code lines. I think it could be shortened.
  2. It is logically not efficient to have these many splitted or checks. I think it could be more efficient probably decreasing the runtime.

I search for an efficient solution which:

  1. Improves the readability
  2. Most importantly: reduces the runtime (as I think the way it is written now is inefficient, with all the or statements)

I only want to check if there are semicolons where I would expect them. Where I need them. I do not care about any additional semicolons in the data fields.

PSt
  • 97
  • 11
  • Why aren't you parsing the file as a CSV? What is your actual goal? To parse the file, or just check the structure? Why not use a regex on each line like `[A-Z]{2};\d;\d{5};`etc. – Tomerikoo Jan 02 '23 at 09:48
  • No. The file cannot be parsed as a csv, as it is NOT a comma or any other "sign" separated file. It is a length-delimited txt file. My question is specifically about checking the position of semicolon at the expected position. The goal is to check if there are semicolons, where I would expect them. Of course, there could be many more at different locations! But these are the positions where I have to make sure that there are semicolons. – PSt Jan 02 '23 at 09:57
  • 1
    It actually _is_ a delimited file (under the umbrella of CSV, which supports basically any kind of delimiter, not just commas)... you're simply delimiting each column with a semicolon instead of a comma, and each column has a fixed length. – TylerH Jan 11 '23 at 17:34

2 Answers2

3

Just going off of what you've written:

filename = ...

with open(filename) as file:
    lines = file.readlines()
delimiter_indices = (2, 4, 10, 14) # The indices in any given line where you expect to see semicolons.
for line_num, line in enumerate(lines):
    if any(line[index] != ";" for index in delimiter_indices):
        print(f"{filename}: Semicolon expected on line #{line_num}")

If the line doesn't have at least 15 characters, this will raise an exception. Also, lines like ;;;;;;;;;;;;;;; are technically valid.


EDIT: Assuming you have an input file that looks like:

AB;2;43234;343;
CD;4;41234;443;
FE;5;53234;543;
FE;5;53;34;543;

(Note: the blank line at the end) My provided solution works fine. I do not see any exceptions or Semicolon expected on line #... outputs.

If your input file ends with two blank lines, this will raise an exception. If your input file contains a blank line somewhere in the middle, this will also raise an exception. If you have lines in your file that are less than 15 characters long (not counting the last line), this will raise an exception.

You could simply say that every line must meet two criteria to be considered valid:

  1. The current line must be at least 15 characters long (or max(delimiter_indices) + 1 characters long).
  2. All characters at delimiter indices in the current line must be semicolons.

Code:

for line_num, line in enumerate(lines):
    is_long_enough = len(line) >= (max(delimiter_indices) + 1)
    has_correct_semicolons = all(line[index] == ';' for index in delimiter_indices)

    if not (is_long_enough and has_correct_semicolons):
        print(f"{filename}: Semicolon expected on line #{line_num}")

EDIT: My bad, I ruined the short-circuit evaluation for the sake of readability. The following should work:

is_valid_line = (len(line) >= (max(delimiter_indices) + 1)) and (all(line[index] == ';' for index in delimiter_indices))
if not is_valid_line:
    print(f"{filename}: Semicolon expected on line #{line_num}")

If the length of the line is not correct, the second half of the expression will not be evaluated due to short-circuit evaluation, which should prevent the IndexError.


EDIT: Since you have so many files with so many lines and so many semicolons per line, you could do the max(delimiter_indices) calculation before the loop to avoid having calculate that value for each line. It may not make a big difference, but you could also just iterate over the file object directly (which yields the next line each iteration), as opposed to loading the entire file into memory before you iterate via lines = file.readlines(). This isn't really required, and it's not as cute as using all or any, but I decided to turn the has_correct_semicolons expression into an actual loop that iterates over delimiter indices - that way your error message can be a bit more explicit, pointing to the offending index of the offending line. Also, there's a separate error message for when a line is too short.

import glob

delimiter_indices = (2, 4, 10, 14)
max_delimiter_index = max(delimiter_indices)
min_line_length = max_delimiter_index + 1

for path in glob.glob(r"C:\path\*.txt"):
    filename = path.name
    print(filename.center(32, "-"))
    with open(path) as file:
        for line_num, line in enumerate(file):
            is_long_enough = len(line) >= min_line_length
            if not is_long_enough:
                print(f"{filename}: Line #{line_num} is too short")
                continue

            has_correct_semicolons = True
            for index in delimiter_indices:
                if line[index] != ";":
                    has_correct_semicolons = False
                    break

            if not has_correct_semicolons:
                print(f"{filename}: Semicolon expected on line #{line_num}, character #{index}")

print("All files done")
Paul M.
  • 10,481
  • 2
  • 9
  • 15
  • Thanks for your solution and it works. However, there is a problem: For some reason your solution gives an additional output compared to my original code, when the last line in a file is empty. So then your code gives a print, but mine not. I don't know why, I have updated my question. I don't want a check/print for this last empty line. – PSt Jan 02 '23 at 10:31
  • @PSt [Easiest way to ignore blank lines when reading a file in Python](https://stackoverflow.com/q/4842057/6045800) – Tomerikoo Jan 02 '23 at 10:35
  • No, I cannot ignore blank lines in between, as this is exactly what I have to detect with checking the semicolons. I just need to ignore the last line. So same output as my code gives, I just wanted to have a efficient implementation of all the or combinations. – PSt Jan 02 '23 at 10:37
  • @PSt I don't get an extra print from the original nor from this one. It seems that `readlines` filters the last line if it's empty – Tomerikoo Jan 02 '23 at 10:46
  • @Paul M. Could you check your solution again please, as I get a wrong output. If I run your code I get a print statement for each line that it expects a semicolon. There is somehow a bug in it. Do you get the same output compared to my code? – PSt Jan 02 '23 at 10:47
  • Your solution does not lead to the same output as mine, as I get an error IndexError: string index out of range in case there is an invalid line. However, I need the program to continue running, I just want a print statement, not an error. So this happens if there is a line which does not have the referenced index. – PSt Jan 02 '23 at 11:00
  • 1
    @PSt I've edited my post, but I don't see this issue with your example input file. Does your actual input file contain blank lines in the middle, or lines that are shorter than 15 characters? – Paul M. Jan 02 '23 at 11:53
  • Thanks for the update. I have edited and updated my question. I have now an example where my original code works, however yours leads to an indexerror. Could you maybe please check again? Thanks a lot for your ongoing help! – PSt Jan 02 '23 at 12:49
  • @PSt Sure - I've edited my post one more time. – Paul M. Jan 02 '23 at 12:55
  • Ok, thanks for the update. It works now, however, this solution is not more efficient than my orginial code, as it has a slightly longer runtime? – PSt Jan 02 '23 at 13:14
  • @PSt how many files are you processing, and how many lines per file? How many semicolons per line? – Paul M. Jan 02 '23 at 18:24
  • Approx. 345 files and each file is between a few mb and almost 1 GB. I expect approx. 80 semicolons. So these are the semicolon positions I want to check. – PSt Jan 03 '23 at 08:43
  • @PSt That certainly puts things in perspective - I've edited my post one more time. – Paul M. Jan 03 '23 at 10:07
-1

If you just want to validate the structure of the lines, you can use a regex that is easy to maintain if your requirement changes:

import re

with open(fname) as f:
    for row, line in enumerate(f, 1):
        if not re.match(r"[A-Z]{2};\d;\d{5};\d{3};", line):
            print("\nSemicolon expected, but not found!\nrow:", row, "\n", fname, "\n", line)

Regex demo here.

If you don't actually care about the content and only want to check the position of the ;, you can simplify the regex to: r".{2};.;.{5};.{3};"

Demo for the dot regex.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
  • It is not necessary nor efficient for me to check the strucutre of the lines with "r"[A-Z]{2};\d;\d{5};\d{3}". It is sufficient to go for r".{2};.;.{5};.{3};". As I do not care about the content, I just care about checking the position. What is more efficient your reg ex solution or Paul M.'s solution? Regarding your regex solution: The numbers in {} specify the expected content length? What is the dot doing? Why is there a dot without a {} and a number in it? – PSt Jan 02 '23 at 10:16
  • 1
    A `.` in regex is a "match-all" symbol. So if you don't care about the actual content, you can use it. the `{}` with numbers is the repetition which I implied from your example. I added links to demos to help you understand the regex mechanics @PSt – Tomerikoo Jan 02 '23 at 10:22
  • Thanks for the help. But what is the most efficient way regarding reduction of the runtime? Use a regex approach or Paul's solution? – PSt Jan 02 '23 at 10:22
  • @PSt You can check for yourself [Is there any simple way to benchmark Python script?](https://stackoverflow.com/q/1593019/6045800) – Tomerikoo Jan 02 '23 at 10:24