Split file and read content without saving files in python

Question

Currently i'm reading all files in a folder and based on logs i'm getting Error and success count of products. It was working until yesterday. Each log file used to have information about single product but due to some technical glitch we started getting 2 products in a single file. We have fixed the issue for future. But we need to have the below data for our analytics purpose.

FileA

2022-03-28T11:53:50 Program Start
2022-03-28T11:53:50 PRODUCT "Screw"
2022-03-28T11:53:51 Code Executing
2022-03-28T11:53:51 ERROR
2022-03-28T11:53:52 Checking other stuffs like Order,Location....

FileB

2022-03-28T11:54:00 Program Start
2022-03-28T11:54:00 PRODUCT "Nut"
2022-03-28T11:54:01 Code Executing
2022-03-28T11:54:01 SUCCESS
2022-03-28T11:54:01 Checking other stuffs like Order,Location....

FileC

2022-03-28T11:55:01 Program Start
2022-03-28T11:55:01 PRODUCT "Washer"
2022-03-28T11:55:02 Code Executing
2022-03-28T11:55:02 ERROR
2022-03-28T11:55:02 Checking other stuffs like Order,Location....
2022-03-28T11:56:01 Program Start
2022-03-28T11:56:01 PRODUCT "Bolt"
2022-03-28T11:56:01 Code Executing
2022-03-28T11:56:01 SUCCESS
2022-03-28T11:56:02 Checking other stuffs like Order,Location....

Expected output -> Total : 4 Success : 2 Failed : 2

Also i'm taking product information and other details and writing in separate files. so it is not only count of success and errors

Success  Failed  OrderNo
Nut      Screw   1098 
Bolt     Washer  ...

Code i have developed to get the output

import os
import re

path = "\\\\myserver\logs"
files = os.listdir(path)
for file in files:
    if file.endswith(".log"):
        f = '/'.join([path,file])
        with open(f, encoding='utf8') as f:
            count += 1
            content = f.read()
            
            if 'ERROR' in content:
                err +=1
            else:
                pss+=1
                
print("Total-->",count)
print("Success-->",pss)
print("Failed-->",err)

Current Output -> Total : 3 Success : 1 Failed : 2

I have tried splitting and reading the file by following this post but with no success. Program Start is the keyword for splitting. I have only read access to this log path, i cannot save anything. Is there a way i can achieve this on the fly? i have only limited knowledge in python. Appreciate your guidance here.

One problem that you have is that your are incrementing `count` every time you open a file instead of doing it when you encounter a `Program Start`. — eandklahn, Mar 30 '22 at 10:41

jeroenflvr · Accepted Answer · 2022-03-30T10:18:51.873

If you're only interested in success and failure, and not for which product, then you could just count the occurrences of 'ERROR' and 'SUCCESS'.

so with your fileC

Program Start
PRODUCT "Washer"
Code Executing
ERROR
Checking other stuffs....
Program Start
PRODUCT "Bolt"
Code Executing
SUCCESS
Checking other stuffs....

doing this, or maybe just incrementing your counters directly with the results..

>>> f = open('fileC.log', "r")    
>>> content = f.read()
>>> content.count('ERROR')
1
>>> content.count('SUCCESS') 
1
>>>

if you want to match the result to your products, use multiline regex
if you have a lot of those files, maybe have a look at pandas or spark

edit after OP's remark to need the other data:

Then you can use this regex to split on 'Program Start', and take it from there by splitting on newlines and indexing out whatever you want:

>>> content
'Program Start\nPRODUCT "Washer"\nCode Executing\nERROR\nChecking other stuffs....\nProgram Start\nPRODUCT "Bolt"\nCode Executing\nSUCCESS\nChecking other stuffs....\n'
>>> import re
>>> match = re.findall(r'Program Start(.*?)((?:(?!^Program Start)[\s\S])*)', content, re.MULTILINE)
>>> match
[('', '\nPRODUCT "Washer"\nCode Executing\nERROR\nChecking other stuffs....\n'), ('', '\nPRODUCT "Bolt"\nCode Executing\nSUCCESS\nChecking other stuffs....\n')]
>>>

or use str.split() to split on 'Program Start', and do the same:

>>> programs = content.split('Program Start')
>>> programs
['', '\nPRODUCT "Washer"\nCode Executing\nERROR\nChecking other stuffs....\n', '\nPRODUCT "Bolt"\nCode Executing\nSUCCESS\nChecking other stuffs....\n']
>>> for p in programs:
...     print(f'####\n{p}\n####\n')
...
####

####

####

PRODUCT "Washer"
Code Executing
ERROR
Checking other stuffs....

####

####

PRODUCT "Bolt"
Code Executing
SUCCESS
Checking other stuffs....

####

>>>

edit after updated question, adding timestamps: Just expand the regex to include the timestamp pattern:

>>> content
'2022-03-28T11:55:01 Program Start\n2022-03-28T11:55:01 PRODUCT "Washer"\n2022-03-28T11:55:02 Code Executing\n2022-03-28T11:55:02 ERROR\n2022-03-28T11:55:02 Checking other stuffs like Order,Location....\n2022-03-28T11:56:01 Program Start\n2022-03-28T11:56:01 PRODUCT "Bolt"\n2022-03-28T11:56:01 Code Executing\n2022-03-28T11:56:01 SUCCESS\n2022-03-28T11:56:02 Checking other stuffs like Order,Location....\n'
>>> match = re.findall(r'(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\s+Program Start.*?)((?:(?!^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2} Program\sStart)[\s\S])*)', content, re.MULTILINE)
>>> match
[('2022-03-28T11:55:01 Program Start', '\n2022-03-28T11:55:01 PRODUCT "Washer"\n2022-03-28T11:55:02 Code Executing\n2022-03-28T11:55:02 ERROR\n2022-03-28T11:55:02 Checking other stuffs like Order,Location....\n'), ('2022-03-28T11:56:01 Program Start', '\n2022-03-28T11:56:01 PRODUCT "Bolt"\n2022-03-28T11:56:01 Code Executing\n2022-03-28T11:56:01 SUCCESS\n2022-03-28T11:56:02 Checking other stuffs like Order,Location....\n')]
>>>

Then you can still split on newline and regex capture timestamp and log event to parse the results.

Thanks, but i need products and other information as well. I need to split and read it for processing other information — Omega, Mar 30 '22 at 03:50
Thanks again. But there is a problem. It is not splitting properly as i have timestamp on it. Updated sample data. Sorry for not putting this before .Could you please help to overcome this? — Omega, Mar 30 '22 at 09:31

ytung-dev · Answer 2 · 2022-03-31T01:39:03.347

For simplicity purpose, I suggest merging all file as one file, then parse this file.

import os

path = '\\\\myserver\logs'
output = './master-data.txt'

def merge():
    with open(output, 'w', newline='') as out:
        for f in os.listdir(path):
            if f.endswith(".log"):
                with open(os.path.join(path,f), 'r', newline='') as file:
                    lines = file.readlines()
                    for line in lines:
                        out.writelines(line)
def parse():
    count   = 0
    pss = 0
    err = 0
    with open(output, 'r', newline='') as f:
        content = f.read().split('Program Start\r\n')[1:]
        for c in content:
            data = c.split('\r\n')
            ## this gives you ['PRODUCT "Screw"', 'Code Executing', 'ERROR', 'Checking other stuffs like Order,Location....']
            for d in data:
                if 'ERROR' in d:
                    err  += 1
                if 'SUCCESS' in d:
                    pss += 1
            count += 1
            ## print(data)
    print("Total  -->",count)
    print("Success-->",pss)
    print("Failed -->",err)

merge()
parse()

Also, in your code, better use os.path.join then '/'.join

f = os.path.join(path, file)
f = '/'.join([path,file])

Thanks. But it is not splitting properly. I'm not getting output as expected? is it because i have timestamp? — Omega, Mar 30 '22 at 09:33
Answer updated, can try. I simply added a for loop to check is 'SUCCESS' or 'ERROR' in the string for each line in data — ytung-dev, Mar 31 '22 at 01:39

eandklahn · Answer 3 · 2022-03-30T10:57:38.023

You mention splitting, but from what I read you are mainly after getting how many times a program was started and what the status was. You can achieve this by counting the number of occurences of each in the following way

# Opening files and appending file content to a list with content of all files
path = os.getcwd()
logs = [f for f in os.listdir(path) if f.endswith('.log')]
full_log = []
for f in logs:
    with open(f) as fid:
        d = fid.readlines()
        full_log += d

# Making a string with content of all files and using str.count to get occurences
full_text = ' '.join(full_log)
count = full_text.count('Program Start')
success = full_text.count('SUCCESS')
error = full_text.count('ERROR')

print('Total --> ', count)
print('Success --> ', success)
print('Failed --> ', error)

You do mention that it is "not only count of success and errors", however you don't mention what you expect in that case so I did not include anything regarding that in my answer (yet... let me know).

Split file and read content without saving files in python

3 Answers3