Extracting relevant data from a txt file

Question

I know how to extract data from a .txt file if it has a certain format (columns with certain spacing) using numpy.loadtxt, but I'm facing currently a problem a bit more complicating. Let's say a have data of the following format:

*** model xy ***    
    date: 11.14.18                         gate time: 190 sec
    enviroment Ug=    483 counts        time: 09:19:55
    enviroment Ug=    777 counts        time: 09:21:55
    enviroment Ug=    854 counts        time: 09:53:55
                          .
                          .
                          .

The relevant information for me are the counts and the gate time. I know I can use open("some txt file", "r") to read in a txt file, but I don't know how to remove the useless information of each line.

Possible duplicate of [How to efficiently parse fixed width files?](https://stackoverflow.com/questions/4914008/how-to-efficiently-parse-fixed-width-files) — cha0site, Nov 14 '18 at 14:23
Is gate time only in one line? or all of the times are gate times too? — Ahmad Khan, Nov 14 '18 at 14:29
@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant. — Sito, Nov 14 '18 at 14:30

score 1 · Answer 1 · answered Nov 14 '18 at 14:24

You need to read txt line by line you can use readlines() for that purpose. For each line that starts from 2th row you can split string

"enviroment Ug=    483 counts        time: 09:19:55".split()

this will result with

['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']

you can access [2] and [-1] element to get informations that you need

Mayank Porwal · Answer 2 · 2018-11-14T14:40:31.263

Try using pandas for this:

Assuming your file to be fixed-width file with 1st record as header, you can do the following:

In [1961]: df = pd.read_fwf('t.txt')

In [1962]: df
Out[1962]: 
   date: 11.14.18  Unnamed: 1 Unnamed: 2  gate time: 190  sec
0  enviroment Ug=         483     counts  time: 09:19:55  NaN
1  enviroment Ug=         777     counts  time: 09:21:55  NaN
2  enviroment Ug=         854     counts  time: 09:53:55  NaN

In [1963]: df.columns
Out[1963]: 
Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',
       u'sec'],
      dtype='object')

# the above gives you the column names. 
#You can see in `df` that the counts values  and gate_time values lie in individual columns.

So, just extract those columns from the dataframe(df):

In [1967]: df[['Unnamed: 1', 'gate time: 190']]
Out[1967]: 
   Unnamed: 1  gate time: 190
0         483  time: 09:19:55
1         777  time: 09:21:55
2         854  time: 09:53:55

Now, you can write the above in a csv file.

In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])

This approach basically saves you from using for loops and complex regex.

Ahmad Khan · Accepted Answer · 2018-11-14T14:42:50.133

You can simply read all of the text from the file at once, and find the required data with a regex:

import re
with open("some txt file", "r") as fin:
    all_text = fin.read()

    # Find the gate time
    gate_time_r = re.compile(r'gate\s+time:\s+(\d+)', re.IGNORECASE)
    gate_time = int(gate_time_r.search(all_text).groups()[0])

    # Find the counts
    counts_r = re.compile(r'enviroment\s+ug=\s+(\d+)', re.IGNORECASE)
    counts_list = list(map(int, counts_r.findall(all_text)))

Gate time regex: gate\s+time:\s+(\d+) simply matches a pattern where there comes a number after string gate time:, and matches that number in a group. And you can simply run this regex with gate_time_r.search(all_text), it will find a match, and you can pick its first group.

Counts regex: enviroment\s+ug=\s+(\d+). It matches a pattern where tehre comes a number after enciroment ug=, and picks that number in a group.

As there are more than one matches in the all_text string for this, you can use findall to search for all of the matches.

It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.

Extracting relevant data from a txt file

3 Answers3