0

I have the file which stores the data in the below format

TIME[04.26_12:30:30:853664]ID[ROLL:201987623]MARKS[PHY:100|MATH:200|CHEM:400]
TIME[03.27_12:29:30.553669]ID[ROLL:201987623]MARKS[PHY:100|MATH:1200|CHEM:900]
TIME[03.26_12:28:30.753664]ID[ROLL:2341987623]MARKS[PHY:100|MATH:200|CHEM:400]
TIME[03.26_12:29:30.853664]ID[ROLL:201978623]MARKS[PHY:0|MATH:0|CHEM:40]
TIME[04.27_12:29:30.553664]ID[ROLL:2034287623]MARKS[PHY:100|MATH:200|CHEM:400]

This type of data is stored in the text file, what I am creating with this text file is that I am making several files with names as ROLL and storing the data of that particular roll number in the text file, For which I am using regex in python this is the code actually file is so large that I can store them in the list using readlines function it'll give memory error so I have to read it line by line here is the code that i have written for it

     import re 
     import os
     import fileinput
     from datetime import datatime
     from collections import defaultdict

     time_for_roll_numbers=defaultdict()# a dictionary I am using the timestamp roll number wise

     with open('Marksinfo.txt','r') as f:
             for line in f:
                ind=re.match(r'(.*)TIME\[' + r'(.*?)](.*)\[ROLL:(.*?)\]',line,re.M|re.I)
                timer_for_roll_numbers.setdefault(int(ind.group(4)),defaultdict(list))['TIME'].append(ind.group(2))
                p=open('ROLL_{}.txt'.format(ind.group(4)),"a")
                p.write(%s % line)
                p.close()

The above function is creating the files according to my wish also , but I want the data to be in sorted format according to timestamp values given in the data that I have no idea how to do because this is fetching the lines sequentially from the above file and writing in the newly made file without considering that the data is sorted or not according to timestamp what I am getting now is this

Actual Output format currently I am getting is as below

In file name ROLL_201987623.txt
 TIME[04.26_12:30:30:853664]ID[ROLL:201987623]MARKS[PHY:100|MATH:200|CHEM:400]
 TIME[03.27_12:29:30.553669]ID[ROLL:201987623]MARKS[PHY:100|MATH:1200|CHEM:900]

Desired Output format should be as below

TIME[03.27_12:29:30.553669]ID[ROLL:201987623]MARKS[PHY:100|MATH:1200|CHEM:900]
 TIME[04.26_12:30:30:853664]ID[ROLL:201987623]MARKS[PHY:100|MATH:200|CHEM:400]

Like wise for every roll number it should be in sorted format in respective files ,please suggest any ideas how to do it

In my code I have fetched this time stamp also and converted it into the following format using the date time library in python suppose for particular roll number I want to fetch every detail of the timestamp this I am using (say sample roll number is 201987623

time_for_particular_roll=timer_for_roll_numbers[201987623]['TIME']
dt = [datetime.strptime(s, '%m.%d_%H:%M:%S.%f') for s in time_for_particular_roll]

dt is containing in the below format which I can access easily

(4,26,12,30,30,853664)

Now I am not getting how to insert in sorted format the information of particular roll number in the newly made file for that roll number

Ankur
  • 457
  • 6
  • 21

1 Answers1

0

I would use sorting and itertools.groupby.

For grouping lines by ROLL once sorted by ROLL and timestamp. Here is the script I would use as a first approach:

import re
from itertools import groupby

regex = re.compile(r"^.*TIME\[([^]]+)\]ID\[ROLL:([^]]+)\].+$")

I would define three callables for filtering, sorting and grouping lines:

def func1(arg) -> bool:
    return regex.match(arg)


def func2(arg) -> str:
    match = regex.match(arg)
    if match:
        return match.group(1)
    return ""


def func3(arg) -> int:
    match = regex.match(arg)
    if match:
        return int(match.group(2))
    return 0

Then loop over your input file.

Reject at first non-compliant data. Sort remaining data by ROLL then by timestamp. Then group data by ROLL.

with open(your_input_file) as fr:
    collection = filter(func1, fr)
    collection = sorted(collection, key=func2)
    collection = sorted(collection, key=func3)
    for key, group in groupby(collection, key=func3):
        with open(f"ROLL_{key}", mode="w") as fw:
            fw.writelines(group)

According to your example that snippet will produce four files with data sorted by ascending timestamp.

Don't change the timestamp format of course by setting, for example, days in the first position.

cestMoiBaliBalo
  • 146
  • 2
  • 5
  • thank you very much for answering my question actually I want to sort according to time stamp only – Ankur Jul 02 '20 at 16:19
  • and why are you doing filtering sir – Ankur Jul 02 '20 at 16:31
  • if I have data stored then how should I do it sir – Ankur Jul 02 '20 at 16:31
  • and please help me because when I am using this code to parse a 1.7 GB it's giving memory error – Ankur Jul 02 '20 at 17:06
  • collection = sorted(collection, key=func2) MemoryError – Ankur Jul 02 '20 at 17:07
  • Please take a look a the following topic [link](https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python) for reading your file without any exception. For grouping by ROLL data must be sorted by ROLL at first. Filtering is not mandatory but it helps to remove data that don't respect the regex. You get therefore a clean data collection. – cestMoiBaliBalo Jul 03 '20 at 11:46
  • Take a look at this [page](https://docs.python.org/3.8/library/itertools.html#itertools.groupby) too if you are curious. – cestMoiBaliBalo Jul 03 '20 at 11:58
  • [Link](https://stackoverflow.com/questions/62702224/python-memory-error-when-reading-large-files-need-ideas-to-apply-mutiprocessin) this is doing the work but it's doing bunchwise sorting I want to sort as a whole now what to do? – Ankur Jul 03 '20 at 12:11