0

I'm very new to Python and pandas and I would like to parse a very big text file (200-500 MB) which has the information displayed as Records, with each record containing data fields as rows/lines rather than columns, and such Records having different number of lines (as they provide different type of data). An example below:

....
Record Type  = Record_Name
  attribute1 = 3090 (0x01218)
  attribute2     = attribute_name3090 (type: type_name1)
  timestamp: 21:47:33.360000
  attribute4: 731001-1 (0x0b277911)
  attribute5: 50000 (0x000000c350)
  attribute6: 3934 (0x0000000f5e)
  attribute7: 857 (0x0000000359)
Record Type  = Record_Name  
  attribute1 = 3099 (0x01227)
  attribute2     = attribute_name3099 (type: type_name2)
  timestamp: 21:49:07.359000
  attribute4     = 731001-3 (0x0b277911)
  attribute8: 14 (0x0000000e)
  attribute9: 17 (0x00000011)
  attribute10: 43 (0x0000002b)
  attribute11: 40 (0x00000028)
Record Type  = Record_Name
  attribute1 = 3090 (0x01218)
  attribute2     = attribute_name3090 (type: type_name1)
  timestamp: 21:51:17.142000
  attribute4: 942101-2 (0x0b345872)
  attribute5: 2490368 (0x00260000)
  attribute6: 24 (0x00000018)
  attribute7: 25 (0x00000019)
Record Type  = Record_Name
  attribute1 = 3102 (0x01230)
  attribute2     = attribute_name3102 (type: type_name1)
  timestamp: 21:52:42.359000
  attribute4: 731001-2 (0x0b277911)
  attribute12: 0 (0x0000000000)
  attribute13: 80 (0x0000000050)
....

Thus, for post-processing, I would like to extract the desired data into a couple of pandas dataframes, one to summarize the records, and the other to filter for a particular type of record.

DATAFRAME 1 SUMMARY TABLE OF RECORDS: create a dataframe indexed by timestamp showing only a couple of attributes of each record, all keeping string format BUT without the hex values within ():

                  attribute1  attribute2          type         attribute4
timestamp
...
21:47:33.360000   3090        attribute_name3090  type_name1   731001-1
21:49:07.359000   3099        attribute_name3099  type_name2   731001-3
21:51:17.142000   3090        attribute_name3090  type_name1   942101-2
21:52:42.359000   3102        attribute_name3102  type_name1   731001-2
....

DATAFRAME 2 FILTERED RECORDS: capture only records associated with attribute1 = 3090 (of type = typename1) and create the following dataframe indexed by timestamp, with the attributes5-7 without the the hex values within () and converted from string to integers:

                  attribute4   attribute5   attribute6   attribute7   
timestamp
...
21:47:33.360000   731001-1     5000         3934         857
21:51:17.142000   942101-2     2490368      24           25
....

I tried opening the file and read lines but it takes a large amount of time. I read about "generators" but couldn't figure out how to ue them to simplify the code. Your suggestions would be greatly appreciated. Thanks in advance.

Gus

Gus Whu
  • 1
  • 1
  • Check this answer: http://stackoverflow.com/questions/13651117/pandas-filter-lines-on-load-in-read-csv – Ted Petrou Dec 16 '16 at 18:25
  • Hello Ted, thanks for the feedback. The original file is .txt (text) rather than csv which makes it more challenging to me. I've loaded csv files files in panda and work with them... in this particular case, I would like (1) to read the .txt, (2) parse and put the data into TWO pd.dataframes, and then (3) save the outputs into CSV. My problems are with (2). Regrads, – Gus Whu Dec 16 '16 at 18:35
  • Please, take a look at this [topic](http://stackoverflow.com/questions/19632075/how-to-read-file-with-space-separated-values). – Jakub Jankowski Dec 16 '16 at 19:16

1 Answers1

0

This code shows how to parse the input lines into two dictionaries in a simple way. It assumes that the input is very regular. The main action comes just after reading each line where lines are split on blanks, colons are removed (which assumes that there are none in the data) and the equal signs are removed — all to simplify subsequent processing based on the first item in each line.

If you run the code you'll find that it creates two text files, each of which consists of one dictionary per line.

with open ('DF1.txt', 'w') as DF1:
    with open ('DF2.txt', 'w') as DF2:

        DF1_record = {}
        DF2_record = {}
        with open('bigText.txt') as bigText:
            for inputLine in bigText:
                inputLine = [_.replace(':', '') for _ in inputLine.strip().split()]
                if '=' in inputLine:
                    inputLine.remove('=')
                kind = inputLine[0]
                if kind=='Record':
                    if DF1_record:
                        DF1.write(str(DF1_record)+'\n')
                    if DF2_record:
                        DF2.write(str(DF2_record)+'\n')
                        DF1_record={}
                        DF2_record={}
                    continue
                if kind in ['attribute1', 'attribute2', 'attribute4', 'timestamp']:
                    if kind=='attribute2':
                        DF1_record[kind]=inputLine[1]
                        DF1_record['type']=inputLine[3][:-1]
                    else:
                        DF1_record[kind]=inputLine[1]
                if kind in ['attribute1', 'attribute4', 'attribute4', 'attribute4', 'attribute7', 'timestamp']:
                    if DF1_record['attribute1']=='3090':
                        DF2_record[kind]=inputLine[1]
Bill Bell
  • 21,021
  • 5
  • 43
  • 58