0

Firstly, I'm very new to Python and programming in general so please bear with me if this is a stupidly obvious question.

I have an undefined amount (possibly 10+) of log files mixed with other random files in a directory, and I need to merge these into a single file with the lines sorted by the time stamp at the beginning of each line. The log files are .txt and there are other non-log .txt files in the same directory so I'm going to just make the user of this script enter every log file as an argument.

Now before you mark this as a duplicate, I looked through 4 pages of search results on here and none of the questions have an answer I can use.

So far, I have the following sort-of working Python code:

log_file_name = 'logfile.txt'

import sys
import fileinput
from Tkinter import Tk
from tkFileDialog import askopenfilenames

logfile = open(log_file_name, 'w+')
logfile.truncate()
logfile.seek(0)

# get list of file names
print "Opening File Dialog"
Tk().withdraw()
files = askopenfilenames(title='Select all logs you would like to compile.')

for index in range(len(files)):
    print "Loop ", index
    print "--- Debug message: Reading a file... ---"
    logdata = open((files[index])).readlines()
    print "--- Debug message: Finished reading. Writing a file... ---"
    # turns logdata into a string and writes it to logfile
    logfile.write(''.join(logdata))
    logfile.write("\n")

print ""
print "Exited for loop."
logfile.close()

The above code puts the contents of all the files you select into a single text file, but it doesn't sort them.

I was thinking of using regex to search for numbers inside of brackets and then sort each line based on that...?

Here are some sample log files contents.

[xx.xxxxxx] [Text] Text : Text: xxx
[xx.xxxxxx] [Text] Text : Text: xxx
[xx.xxxxxx] [Text] Some text.
There could be multiple lines of text here
These lines could include [brackets.] :(

[xx.xxxxxx] [Text] Text : Text: xxx

The [xx.xxxxxx] is the time stamp in seconds since system startup.

  • 1
    What is the layout of the log files (where is the time stamp), and how big will the resulting file be. If the resulting file can be easily stored in memory, then you can use a simple sort. If not, you have to divide the records and sort the first group (earliest time stamps for example), write it to the file, sort the next group, write it, etc. –  Aug 10 '15 at 19:14
  • @CurlyJoe I edited my question to add some sample log text. –  Aug 10 '15 at 19:19
  • @CurlyJoe It would be completely fine to load all the logs into memory. –  Aug 10 '15 at 19:30

2 Answers2

0

Since the time stamp is at the beginning of each record, you can just sort. If it takes too long then you might want to sort each log file on input and merge into the final list

import pprint

file_1="""[92.5] Text Text : Text: xxx
[91.5] Text Text : Text: xxx"""

file_2="""[91.7] [Text] Some text.
Some text of variable size, may be on multiple lines. Number of lines is variable also. 
[90.5] [Text] Some text.
Some text of variable size, may be on multiple lines. Number of lines is variable also."""

## Write data to some test log files
with open("./log_1.txt", "w") as fp_out:
    fp_out.write(file_1)
with open("./log_2.txt", "w") as fp_out:
    fp_out.write(file_2)

def input_recs(f_name):
    recs=open(f_name, "r").readlines()
    ## assume you want to omit text only lines
    return_list=[rec.strip() for rec in recs if rec[1].isdigit()]
    return return_list

sorted_list=[]
for f_name in ["log_1.txt", "log_2.txt"]:
    recs=input_recs(f_name)
    sorted_list.extend(recs)

sorted_list.sort()
pprint.pprint(sorted_list)
  • I want to keep all the contents of the logs intact, and the script needs to be able to handle an unlimited number of arguments (which would be the log file names) but otherwise this looks good. I'll try it tomorrow. :) –  Aug 10 '15 at 22:21
  • 1
    Take a look at http://stackoverflow.com/questions/3579568/choosing-a-file-in-python-with-simple-dialog to use Tkinter to select files (many other examples on the web). –  Aug 10 '15 at 23:09
  • Wow that would be so much better than having users type in the filenames, thanks! –  Aug 11 '15 at 14:27
  • Could you please take a look at my updated code @CurlyJoe –  Aug 12 '15 at 21:53
  • I couldn't get your code to work. I'm very very new to this so I apologize if it's something stupid. I can get you the error message tomorrow if you need it. –  Aug 12 '15 at 23:00
  • What does "I couldn't get your code to work" mean. Print files[index] on each pass through the loop. You may have to add the directory path. Note also that you do not sort anywhere, nor do you add each readlines to a list in memory that can be sorted. –  Aug 12 '15 at 23:56
  • I don't want to omit anything, I need the log data completely intact. I can't figure out how to change this code so that it does that; all I can get it to do is print all the lines in alphabetical order which mixes all the messages together. I need it to recognize each message and then sort the messages, not each line. –  Aug 13 '15 at 14:45
  • I'm also having trouble understanding what this code is doing exactly. What is 'recs' short for...? –  Aug 13 '15 at 14:59
  • recs is short for records. –  Aug 13 '15 at 18:14
  • Well, your answer doesn't work for me completely but you've helped me a lot in solving this so I marked it as correct. To anyone finding this on Google with a similar problem: Use regex to find the messages in your log, and make each message an element inside of a list. Then sort the list with sort(). –  Aug 13 '15 at 18:29
0

When you don't get good answers, it means that you aren't asking good questions. What does "recognize each message and then sort the messages, not each line" mean. I will assume for purposes of illustration on how to generally do this, that you want the lines that do not have time stamps to be included with the previous time stamp. You have to get the data in some kind of order that can be sorted on certain rec(ord)s. There are two ways to do this using a dictionary or a list of lists. The following uses a list of lists and simply appends the non-time-stamp rec(ord)s to the previous time stamp rec so all records start with a time stamp and the list can be sorted. By now you should understand the general principle involved.

file_1="""[92.5] [Text1[ Text : Text: xxx
[91.5] [Text2[ Text : Text: xxx
[92.5] [Text2.5] Some text.
[90.5] [Text3] Some text"""

file_2="""[91.7] [Text4] Some text.
Some text of variable size, may be on multiple lines. Number of lines is variable also. 
[90.5] [Text5] Some text.
Some text of variable size, may be on multiple lines. Number of lines is variable also."""

## Write data to some test log files
with open("./log_1.txt", "w") as fp_out:
    fp_out.write(file_1)
with open("./log_2.txt", "w") as fp_out:
    fp_out.write(file_2)

def input_recs(f_name):
    return_list=[]
    append_rec=""
    with open(f_name, "r")as fp_in:
        for rec in fp_in:
            if rec[1].isdigit():
                ## next time stamp so add append_rec to return_list and
                ## create a new append_rec that contains this record
                if len(append_rec): 
                    return_list.append(append_rec)
                append_rec=rec
            else:
                append_rec += rec  ## not a time stamp

    ## add last rec
    if len(append_rec): 
        return_list.append(append_rec)

    return return_list

sorted_list=[]
for f_name in ["log_1.txt", "log_2.txt"]:
    recs_list=input_recs(f_name)
    sorted_list.extend(recs_list)

sorted_list.sort()
import pprint
pprint.pprint(sorted_list)  ## newlines are retained
  • What I meant by recognizing messages rather than lines is that each line may contain several line breaks, so I couldn't just sort the lines. –  May 25 '18 at 17:28