I'm very new to Python and pandas and I would like to parse a very big text file (200-500 MB) which has the information displayed as Records, with each record containing data fields as rows/lines rather than columns, and such Records having different number of lines (as they provide different type of data). An example below:
....
Record Type = Record_Name
attribute1 = 3090 (0x01218)
attribute2 = attribute_name3090 (type: type_name1)
timestamp: 21:47:33.360000
attribute4: 731001-1 (0x0b277911)
attribute5: 50000 (0x000000c350)
attribute6: 3934 (0x0000000f5e)
attribute7: 857 (0x0000000359)
Record Type = Record_Name
attribute1 = 3099 (0x01227)
attribute2 = attribute_name3099 (type: type_name2)
timestamp: 21:49:07.359000
attribute4 = 731001-3 (0x0b277911)
attribute8: 14 (0x0000000e)
attribute9: 17 (0x00000011)
attribute10: 43 (0x0000002b)
attribute11: 40 (0x00000028)
Record Type = Record_Name
attribute1 = 3090 (0x01218)
attribute2 = attribute_name3090 (type: type_name1)
timestamp: 21:51:17.142000
attribute4: 942101-2 (0x0b345872)
attribute5: 2490368 (0x00260000)
attribute6: 24 (0x00000018)
attribute7: 25 (0x00000019)
Record Type = Record_Name
attribute1 = 3102 (0x01230)
attribute2 = attribute_name3102 (type: type_name1)
timestamp: 21:52:42.359000
attribute4: 731001-2 (0x0b277911)
attribute12: 0 (0x0000000000)
attribute13: 80 (0x0000000050)
....
Thus, for post-processing, I would like to extract the desired data into a couple of pandas dataframes, one to summarize the records, and the other to filter for a particular type of record.
DATAFRAME 1 SUMMARY TABLE OF RECORDS: create a dataframe indexed by timestamp showing only a couple of attributes of each record, all keeping string format BUT without the hex values within ():
attribute1 attribute2 type attribute4
timestamp
...
21:47:33.360000 3090 attribute_name3090 type_name1 731001-1
21:49:07.359000 3099 attribute_name3099 type_name2 731001-3
21:51:17.142000 3090 attribute_name3090 type_name1 942101-2
21:52:42.359000 3102 attribute_name3102 type_name1 731001-2
....
DATAFRAME 2 FILTERED RECORDS: capture only records associated with attribute1 = 3090 (of type = typename1) and create the following dataframe indexed by timestamp, with the attributes5-7 without the the hex values within () and converted from string to integers:
attribute4 attribute5 attribute6 attribute7
timestamp
...
21:47:33.360000 731001-1 5000 3934 857
21:51:17.142000 942101-2 2490368 24 25
....
I tried opening the file and read lines but it takes a large amount of time. I read about "generators" but couldn't figure out how to ue them to simplify the code. Your suggestions would be greatly appreciated. Thanks in advance.
Gus