Python script for searching variable strings between two constant strings

Question

import re

infile = open('document.txt','r')
outfile= open('output.txt','w')
copy = False
for line in infile:

    if line.strip() == "--operation():":
        bucket = []
        copy = True

    elif line.strip() == "StartOperation":
        for strings in bucket:
            outfile.write( strings + ',')
        for strings in bucket:
            outfile.write('\n')
        copy = False

    elif copy:
        bucket.append(line.strip()

CSV format is like this:

id,          name,                poid,         error
5896, AutoAuthOSUserSubmit,     900105270,      0x4002

My log file has several sections starting with ==== START ==== and ending with ==== END ====. I want to extract the string between --operation(): and StartOperation. For example, AutoAuthOSUserSubmit. I also want to extract the poid value from line poid: 900105270, poidLen: 9. Finally, I want to extract the return value, e.g 0x4002 if Roll back all updates is found after it.

I am not even able to extract point the original text if Start and End are not on the same line. How do I go about doing that?

This is a sample LOG extract with two paragraphs:

-- 08/24 02:07:56 [mds.ecas(5896) ECAS_CP1] **==== START ====**
open file /ecas/public/onsite-be/config/timer.conf failed
INFO 08/24/16 02:07:56  salt1be-d1-ap(**5896**/0)  main.c(780*****):--operation(): AutoAuthOSUserSubmit. StartOperation*****
INFO 08/24/16 02:07:56  salt1be-d1-ap(5896/0)  main.c(784):--Client Information: Request from host 'malt-d1-wb' process id 12382.
DEBUG 08/24/16 02:07:56  salt1be-d1-ap(5896/0)  TOci.cc(571):FetchServiceObjects: ServiceCert.sql
DEBUG 08/22/16 23:15:53  pepper1be-d1-ap(2680/0)  vsserviceagent.cpp(517):Generate Certificate 2: c1cd00d5c3de082360a08730fef9cd1d
DEBUG 08/22/16 23:15:53  pepper1be-d1-ap(2680/0)  junk.c(1373):GenerateWebPin : poid: **900105270**, poidLen: 9
DEBUG 08/22/16 23:15:53  pepper1be-d1-ap(2680/0)  junk.c(1408):GenerateWebPin : pinStr 
DEBUG 08/24/16 02:07:56  salt1be-d1-ap(5896/0)  uaadapter_vasco_totp.c(275):UAVascoTOTPImpl.close() -- Releasing Adapter Context
DEBUG 08/22/16 23:15:53  pepper1be-d1-ap(2680/0)  vsenterprise.cpp(288):VSEnterprise::Engage returns 0x4002 - Unknown error code **(0x4002)**
ERROR 08/22/16 23:15:53  pepper1be-d1-ap(2680/0)  vsautoauth.cpp(696):OSAAEndUserEnroll: error occurred. **Roll back** all updates!
INFO 08/24/16 02:07:56  salt1be-d1-ap(5896/0)  uaotptokenstoreqmimpl.cpp(199):Close token store
INFO 08/24/16 02:07:56  salt1be-d1-ap(5896/0)  main.c(990):-- EndOperation
-- 08/24 02:07:56 [mds.ecas(5896) ECAS_CP1] **==== END   ====**
    OPERATION = AutoAuthOSUserSubmit, rc = 0x0 (0)
    SYSINFO Elapse = 0.687, Heap = 1334K, Stack = 64K

score 1 · Answer 1 · edited May 23 '17 at 12:14

It looks like you are simply trying to find strings within the LOG document and trying to parse the lines of characters using keywords. You can go line by line which is what you are doing currently or you could go through the document once (assuming the LOG document never gets huge) and add each subsequent line to an existing string.

Check this out for finding substrings http://www.tutorialspoint.com/python/string_index.htm <--- for finding the location of where a string is within another string, this will help you determine a start index and an end index. Once you have those you can extract your desired information.

Check this out for your CSV problem http://www.tutorialspoint.com/python/string_split.htm <--- for splitting a string around a specific character i.e. "," for your CSV files.

Does Python have a string contains substring method? will be more useful than your current method of using the strip() method

Hopefully this will point you in the right direction!

score 1 · Accepted Answer · answered Aug 27 '16 at 01:44

This looks like a job for Regular Expressions! Several in fact. Thankfully, they are not very complicated to use in this case.

There are 2 main observations that would make me choose regexes over something else:

Need to extract one bit of variable text from between two known constant values
Need to follow this same pattern several times for different strings

You can try something like this:

import re

def capture(text, pattern_string, flags=0):
    pattern = re.compile(pattern_string, flags)
    match = pattern.search(text)
    if match:
        output = match.group(1)
        print '{}\n'.format(output)
        return output
    return ''

if __name__ == '__main__':
    file = read_my_file()

    log_pattern = "\*\*==== START ====\*\*(.+)\*\*==== END   ====\*\*"
    log_text = capture(file, log_pattern, flags=re.MULTILINE|re.DOTALL)

    op_pattern = "--operation\(\): (.+). StartOperation\*\*\*\*\*"
    op_name = capture(log_text, op_pattern)

    poid_pattern = "poid: \*\*([\d]+)\*\*, poidLen: "
    op_name = capture(log_text, poid_pattern)

    retcode_pattern = "Unknown error code \*\*\((.+)\)\*\*.+\*\*Roll back\*\* all updates!"
    retcode = capture(log_text, retcode_pattern, flags=re.MULTILINE|re.DOTALL)

This approach essentially divides up the problem into several largely independent steps. I'm using capturing groups in each regex - the parens like (.+) and ([\d]+) - in between long strings of constant characters. The multiline and dotall flags allow you to easily deal with line breaks in the text and treat them just like any other part of the string.

I'm also making a big assumption here and that is your logs are not huge files, maybe a few hundred megabytes tops. Note the call to read_my_file() - rather than try to solve this problem a line at a time, I chose to read the entire file and work in memory. If the files get really big though, or you're building an app that will get a lot of traffic, this may be a bad idea.

Hope this helps!

Thanks for the lead. I have one big log file nearly 2 GB. Lets say I just want to extract the ;name' field only eg. AutoAuthOSUserSubmit. The code is failing with errors. Can you just give a tested code only for the name field. Other fields I will try to work out. — Aryabhatta, Aug 27 '16 at 15:05
@Tim one reason it might be failing is because I didn't define `read_my_file()` - this was intentional, and is best left to you since I don't know where your logs are coming from. Also, I am using python 2.7 - if you're using 3 it's possible my code doesn't work. What's the error? Which line? 2GB might be a little too big to read into memory anyway, and it might be good to chop the file up into smaller files first as a preprocessing step. — killthrush, Aug 27 '16 at 17:45
Also, your log may not be as 'constant' as you thought. A lot of garbage can hide in 2GB worth of text. — killthrush, Aug 27 '16 at 17:51
The more I think about it, you really should try to split up that big file into smaller files first for two reasons - 1) a smaller file can be read into memory easily and 2) if something goes wrong with the parsing you'll be able to pinpoint exactly which operation caused it. For splitting, you can match on `\*\*==== START ====\*\*` then write all lines to a new fie until you match `\*\*==== END ====\*\*`. After you have smaller files, the `--operation: (.+). StartOperation\*\*\*\*\*` pattern can capture the name. The rest of the fields are handled similarly as the example shows — killthrush, Aug 27 '16 at 17:56
$ python search4.py File "search4.py", line 23 with("mylogfile.log") as f: ^ SyntaxError: invalid syntax — Aryabhatta, Aug 27 '16 at 18:09
@Tim - see the comment attached to your example. Your error appears to be related to older versions of python such as 2.4 combined with `with` syntax. Regarding the rest, well it's up to you. You have a working example of a solution and a recommendation for a way forward. — killthrush, Aug 27 '16 at 18:29

Python script for searching variable strings between two constant strings

2 Answers2