Is there a faster way to parse this text file?

Question

I'm parsing date/time/measurement info out of some text files that look similar to this:

[Sun Jul 15 09:05:56.724 2018] *000129.32347
[Sun Jul 15 09:05:57.722 2018] *000129.32352
[Sun Jul 15 09:05:58.721 2018] *000129.32342
[Sun Jul 15 09:05:59.719 2018] *000129.32338
[Sun Jul 15 09:06:00.733 2018] *000129.32338
[Sun Jul 15 09:06:01.732 2018] *000129.32352

The results go into an output file like this:

07-15-2018 09:05:56.724, 29.32347
07-15-2018 09:05:57.722, 29.32352
07-15-2018 09:05:58.721, 29.32342
07-15-2018 09:05:59.719, 29.32338
07-15-2018 09:06:00.733, 29.32338
07-15-2018 09:06:01.732, 29.32352

The code that I'm using looks like this:

import os
import datetime

with open('dq_barorun_20180715_calibtest.log', 'r') as fh, open('output.txt' , 'w') as fh2:
    for line in fh:
        line = line.split()
        monthalpha = line[1]
        month = datetime.datetime.strptime(monthalpha, '%b').strftime('%m')
        day = line[2]
        time = line[3]
        yearbracket = line[4]
        year = yearbracket[0:4]
        pressfull = line[5]
        press = pressfull[5:13]
        timestamp = month+"-"+day+"-"+year+" "+time
        fh2.write(timestamp + ", " + press + "\n")

This code works fine and accomplishes what I need, but I'm trying to learn more efficient methods of parsing files in Python. It takes about 30 seconds to process a 100MB file and I have several files that are 1-2GB in size. Is there a faster way parse through this file?

Check if it's not faster to bundle a bunch of output lines into a larger string and then write them to the output file only once in a while. — ctenar, Nov 24 '20 at 07:04
Sunday here is `Sun` and July is `Jul`. Do you also need only the first 3 letter for the rest of your data? — Countour-Integral, Nov 24 '20 at 07:08
why is `000129.32338` resulting in `29.32338` ? where went the 1 to? — Patrick Artner, Nov 24 '20 at 07:11
I only need the three letter month `Jul` (in this case) so that I can covert it to `07` in the date format. I don't need the `Sun` — Jason Dunn, Nov 24 '20 at 07:15
The raw output from the instrument is `*0001` and the reading is `29.32338` so I don't need the 1. — Jason Dunn, Nov 24 '20 at 07:17
See [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) — martineau, Nov 24 '20 at 07:22

Olvin Roght · Accepted Answer · 2020-11-24T07:35:06.457

You can declare months dict to not use datetime module, which should be a bit faster.

months = {"Jan": "01", "Feb": "02", "Mar": "03", "Apr": "04", "May": "05", "Jun": "06",
          "Jul": "07", "Aug": "08", "Sep": "09", "Oct": "10", "Nov": "11", "Dec": "12"}

Also you can use unpacking and make your code much simpler:

for line in fh:
    _, month, day, time, year, last = line.split()
    res = months[month] + "-" + day + "-" + year[:4] + " " + time + ", " + last[5:]
    fh2.write(res)

P.S. timeit shows that it's around 10 times faster

score 0 · Answer 2 · answered Nov 24 '20 at 07:14

0

You should use pandas DataFrames, for quick loading and manipulating large data:

import pandas as pd, datetime

df = pd.read_csv('dq_barorun_20180715_calibtest.log',header=None,sep=' ')
df[0] = df.apply(lambda x: x[0][1:],axis=1)
df[1] = df.apply(lambda x: datetime.datetime.strptime(x[1], '%b').strftime('%m'), axis=1)
df[4] = df.apply(lambda x: x[4][:-1],axis=1)
df[5] = df.apply(lambda x: ' ' + x[5][5:],axis=1)
df['timestamp'] = df.apply(lambda x: x[1]+"-"+str(x[2])+"-"+x[4]+" "+x[3], axis = 1)
df.to_csv('output.txt',columns=['timestamp',5],header=False, index=False)

answered Nov 24 '20 at 07:14

cristelaru vitian

141
7

3

You shouldn't use `apply`. It is very slow. I don't think it will be faster than OP. Check [this link](https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code/54432584#54432584) – Inyoung Kim 김인영 Nov 24 '20 at 07:30
I used it with no visible time delay many times before with large datasets (over 5GB of data). – cristelaru vitian Nov 24 '20 at 07:40

score 0 · Answer 3 · answered Nov 24 '20 at 07:26

As you have fixed positions and simple conversion from month name to number, this should work

#! /usr/bin/env python3

m = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', 'May':'05', 'Jun':'06', 'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}
with open('dq_barorun_20180715_calibtest.log', 'r') as fh, open('output.txt' , 'w') as fh2:
    for line in fh:
        day = line[9:11]
        month = m[line[5:8]]
        year = line[25:29]
        time = line[12:24]
        val = line[36:44]
        print('{}-{}-{} {}, {}'.format(month, day, year, time, val), file=fh2)

wheezay · Answer 4 · 2020-11-24T07:55:34.843

0

here is another approach using pandas read_csv, you can use dask if you have many big files it also supports this.

import pandas as pd
import datetime

df = pd.read_csv('D:\\test.txt',sep='\*0001')

df.columns = ['dates','val']
df.dates = pd.to_datetime(df.dates.str[1:-2])
df.to_csv("output.csv",header=None,index=None)

you can then use different methods to convert dates to your desired format.

edited Nov 24 '20 at 07:55

answered Nov 24 '20 at 07:36

wheezay

101
7

score 0 · Answer 5 · answered Nov 24 '20 at 08:41

I may have a completely different idea, but it doesn't speed up parsing itself.
If this isn't what you want because you asked for a faster parsing, please leave a comment and I'll delete this answer.

You should split your large input file into smaller segments. For example, you could try to get the number of lines and divide it by a suitable number. Let's say there are 40,000 lines, then you can divide this into 4 segments and approach 10,000 lines each. Make a note of a starting index and how many lines you want to tackle. (Note that it's possible that the last offset could potentially be smaller than 10,000.)

Then pass the input file to several threads and these only read in the given parts from start-index to offset of the input file and parse this part. The respective threads write the parsed part in a shared folder, but with an indexed filename. After each thread is finished, you can merge all files in the shared folder into one output.txt.

Here are some linked sources of how you can achive some of these things:

score -1 · Answer 6 · answered Nov 24 '20 at 08:25

In order to avoid the open file is too large, resulting in excessive memory usage, this code uses Python yield technology and uses regular content. For specific actual performance, you can compare the performance with the code you wrote.

The following code has been run locally！

import datetime
import re

# File path to be cleaned
CONTENT_PATH = './test03.txt'
# File path of cleaning results
RESULT_PATH = './test03.log'


def read(file):
    with open(file) as obj:
        while True:
            line = obj.readline()
            if line:
                yield line
            else:
                return


def tsplit(line):
    _m, _d, _y = line[5:8], line[9:11], line[25:29]
    _m = datetime.datetime.strptime(_m, '%b').strftime('%m')
    rt = "%s-%s-%s" % (_m, _d, _y)
    return rt


hour_min_sen = re.compile(r'(\d{2}:\d{2}:\d{2}.\d{2})')
end = re.compile(r'(\d{2}\.\d{5})')


with open(RESULT_PATH, 'a+') as obj:
    for line in read(CONTENT_PATH):
        line = line.strip()
        """
        [Sun Jul 15 09:06:01.732 2018] *000129.32352
        """
        group = hour_min_sen.findall(line)
        end_group = end.findall(line)
        """
        07-15-2018 09:05:56.724, 29.32347
        """
        obj.write("%s %s, %s\n" % (tsplit(line), group[0], end_group[0]))

Python's open doesn't read file, it just opens file handle with provided access mode. If you're iterating over file using `for` loop python doesn't read entire file to memory, it let you consume file line by line. So your generator function makes exactly the same what python does by default. — Olvin Roght, Nov 24 '20 at 08:35

Is there a faster way to parse this text file?

6 Answers6