-1

So I have a lot of logs txt files that look somewhat like this:

2021-04-01T12:54:38.156Z START RequestId: 123 Version: $LATEST

2021-04-01T12:54:42.356Z END RequestId: 123

2021-04-01T12:54:42.356Z REPORT RequestId: 123  Duration: 4194.14 ms    Billed Duration: 4195 ms    Memory Size: 2048 MB    Max Memory Used: 608 MB 

I need to create a pandas dataframe with this data with following features where each row would present one log:

DateTime, Keyword(start/end), RequestId, Duration, BilledDuration, MemorySize, MaxMemoryUsed

The problem is that each file has different length and there are different types of logs so not every line looks the same but there are patterns. I've never used RegEx but I think this is what I have to use. So is there a way to transform this string into a dataset?

(my goal is to perform memory usage anomaly detection)

Kami
  • 164
  • 11
  • what is the expected dataframe? What columns/info you need? – Sreeram TP Sep 16 '21 at 13:02
  • DateTime, RequestId, Duration, BilledDuration, MemorySize, MaxMemoryUsed – Kami Sep 16 '21 at 13:08
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Sep 23 '21 at 14:18

2 Answers2

0

There is a similar question here: Log file to Pandas Dataframe

You can use read_csv with separator: \s*\[

0

So apparently I'm still bad at asking right question on this website but gladly a bit better at finding solutions by myself so if somebody else has the same problem this is what I did:

import re
import gzip

counter = 0

for file in file_list:
    # open and read
    file_content = gzip.open(file, 'rb').read().decode("utf-8")
    
    # split file in lines
    splitted_file_content = file_content.splitlines()
    for line in splitted_file_content:
        # look for the report lines
        if re.search('REPORT', line):
            tokens = line.split()
    
            timestamp = tokens[0]
            id = tokens[3]
            billed_duration = tokens[9]
            max_memory_size_used = tokens[18]
            init_duration = tokens[22]
            
            # if you want to pack it in a dataframe
            df.loc[counter] = [timestamp, id, billed_duration,
                               max_memory_size_used, init_duration]
            counter += 1
Kami
  • 164
  • 11