1

I have a log file with 10000 and more lines, some of them have this structure:

"2023-07-19 13:38:45,220 INFO    Type:type: ('rate',), {'value': 123, 'unit': 'Count/Second', 'id': 'ABC123', 'name': 'London'}\n"

I would like to extract these info and put in a pandas.DataFrame.

This is my initial code:

import ast
import pandas as pd
import re

infile = "./log_file.log"
with open(infile) as f:
    lines = f.readlines()

df = pd.DataFrame(columns=["time", "type", "value", "unit", "id", "name"])

for line in lines:
    if "type" in line:
        value = ast.literal_eval(re.search('{(.*)}', line).group(0))
        value["time"] = line.split("INFO")[0][:-1]
        value["type"] = re.search(r"\((.*)\)", line).group(1)[:-1]
        df = df.append(value, ignore_index=True)

so to have a dataframe like this:

                      time  type  value          unit      id    name
0  2023-07-19 13:38:45,220  rate    123  Count/Second  ABC123  London

but the for loop takes ages to go through the whole file.

Any suggestion how to optimise it?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
cicciodevoto
  • 305
  • 4
  • 19

2 Answers2

2

df.append is slow, so much that it's been deprecated and removed in newer versions. (See also: Create a Pandas Dataframe by appending one row at a time, especially cs95's answer)

So, put the rows into a list then turn that into a dataframe as the last step.


Or you could try having Pandas do the heavy lifting for you:

  1. If all the lines are in the same format, write a regex for it, put the lines in a Series, and use .str.extract() to expand the groups into columns.
  2. .apply(ast.literal_eval)
  3. pd.json_normalize

I also wouldn't bother saving any lines you're not going to use. That could save some space/time.

import pandas as pd
import ast

regex = r'^(?P<time>.*) INFO    Type:type: (?P<type>.*), (?P<dict>\{.*\})$'

with open(infile) as f:
    lines = [line for line in f if "type" in line]

df = pd.Series(lines).str.extract(regex)
df['type'] = df['type'].apply(ast.literal_eval).str[0]
# Uses pop!
df = df.join(df.pop('dict').apply(ast.literal_eval).pipe(pd.json_normalize))
                      time  type  value          unit      id    name
0  2023-07-19 13:38:45,220  rate    123  Count/Second  ABC123  London
wjandrea
  • 28,235
  • 9
  • 60
  • 81
1

I suggest to precompile the regexes. Then, load the data to a list and as a final step construct the dataframe. Eg.:

import re
import ast
import pandas as pd

infile = "your_file.log"

pat1 = re.compile(r'\{.*\}')
pat2 = re.compile(r'\(.*\)')

all_data = []
with open(infile, 'r') as f:
    for line in f:
        if 'type' not in line:
            continue

        v = ast.literal_eval(pat1.search(line).group(0))
        t = ast.literal_eval(pat2.search(line).group(0))[0]
        all_data.append({'time': line.split(' INFO ')[0].strip('"') ,'type': t, **v})

df = pd.DataFrame(all_data)
print(df)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91