1

I am very new to python and know a little bit about data structures and retrieving things, and I was wondering if someone could point me in the right direction. I have a text file that has a structure like this (with 17000 or so data points). It's not XML (although looks similar), the data looks exactly like this:

< begin >
    < datapt > 1 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000097 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >

< begin >
    < datapt > 2 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000245 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >

I would like to extract all the x values and the y values in this text (using python), and make a .csv file that has two separate columns for x values and the y values .

How would I begin to construct this particular code? Should I have an easier time with RegEx? Moreover, should I try to somehow convert this into an XML-like format to make it much easier to just slap it into excel?

Any pointers is appreciated!!

  • Is the text EXACTLY like this? It looks similar to XML but without proper closing tags. Is this actually XML? – dawg Jun 18 '20 at 15:20
  • I have copied it EXACTLY the way it looks like, yes. The data is embedded in XML-like markup constructs, but isn't exactly XML. – jayjaytapir Jun 18 '20 at 15:31

4 Answers4

0

Here's something not really efficient as it reads it all to memory and iterates over the lines two times, but it's easy to understand, hopefully - and 17000 points in this way should be doable by any modern machine:

f = open('myfile.txt','r') #Open the file
lines = f.readlines() #Read the lines into an array in memory
x_vals = [line.replace('< xval >','').strip() for line in lines if line.find('xval') != -1]
y_vals = [line.replace('< yval >','').strip() for line in lines if line.find('yval') != -1]
csv = open('myfile.csv','w') #Open the output CSV file for writing
csv.writelines(["%s,%s" % x for x in zip(x_vals,y_vals)]) #Unpack the tuples into a CSV line (May need to tweak the end of the string if doing it in different OSs)
csv.close()

Line 3 and 4 basically iterate over all the lines searching for < xval > and < yval > using find() which returns -1 if the string isn't found, then it replaces that string by an empty string and gets rid of the remaining spaces with strip()

And Line 6 creates a list of tuples from the lists obtained on lines 3 and 4 and joins them via the zip() function. We unpack the tuple using the % operator into the format string "%s,%s" - we might need to use a different line ending (ie, adding \r if running the program on Linux but using the file on Windows, for example) and maybe a different format string like a floating point one %f.

Hope this helps!

Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373
  • Thanks for the response. I used this particular code, but the output is:150.0000970.00000,150.000245 and so on. I only require the floats, and x and y values each separated on a column. This helped a lot though, thank you. – jayjaytapir Jun 19 '20 at 18:19
  • On my test data it worked :-) Anyway, glad it helped! – Vinko Vrsalovic Jun 22 '20 at 11:38
0

You can read each line of the sampledata file using file_object. When we see xval or yval, we parse the line to get the value.

Extracting values:

First, split the line (s.split('>')) which gives

["< xval ", " 150.000097 < xval >"

Do a second split on the 2nd val (s.split('>')[1].split('<')) above to get:

["[ 150.000097 ", " xval >"]

Finally, take the first val from above and remove leading/trailing whitespaces (s.split('>')[1].split('<')[0].strip()) to get:

150.000097

We finally write the values we extracted to a csv file.

Note: I have made sure to write the most readable version of the code. This can certainly be made a lot more compact if needed.

Code:

sampledata.txt

< begin >
    < datapt > 1 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000097 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >

< begin >
    < datapt > 2 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000245 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >
file_object = open("./sampledata.txt", 'r')

def getValFromLine(s):
    return s.split('>')[1].split('<')[0].strip()

x_list = []
y_list = []
for line in file_object:
    if 'xval' in line:
        x_list.append(getValFromLine(line))

    if 'yval' in line:
        y_list.append(getValFromLine(line))

print(x_list)
print(y_list)

# writing to csv file
csv = open('myfile.csv', 'w')
for i in range(min(len(x_list), len(x_list))):
    csv.write(f'{x_list[i]},{y_list[i]}\n')
csv.close()

Result:

['150.000097', '150.000245']
['0.000000', '0.000000']

myfile.csv

150.000097,0.000000
150.000245,0.000000
curious_guy
  • 398
  • 1
  • 7
0

The easiest way to parse this would be to just grab all the x and y values and then zip them together.

Of course, there may be an issue where there is an x-value, but no corresponding y-value. In this case, you will have to know when you "begin" and "end" a new group of values.

#!/usr/bin/env python

import re

SAMPLE_TEXT = '''
< begin >
    < datapt > 1 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000097 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >

< begin >
    < datapt > 2 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000245 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >
'''

num_tag_pattern = lambda key: re.compile('<\s*' + key + '\s*>\s*(\d[.\d]+)\s*<\s*' + key + '\s*>')

x_pat, y_pat = num_tag_pattern('xval'), num_tag_pattern('yval')

def parse_points(text):
    x_vals, y_vals = [], []

    for line in text.strip().split('\n'):
        parse_float(line.strip(), x_pat, x_vals)
        parse_float(line.strip(), y_pat, y_vals)

    return list(zip(x_vals, y_vals))

def parse_float(line, pattern, values):
    m = pattern.match(line.strip())
    if m is not None:
        values.append(float(m[1]))

if __name__ == '__main__':
    print(parse_points(SAMPLE_TEXT))  # [(150.000097, 0.0), (150.000245, 0.0)]

Update

Here is a more generic, safer version that actually parses the source data into JSON, before converting it to points.

I even added raised exceptions if there are any missing tags.

main.py

#!/usr/bin/env python

import re

TAG_BEGIN = re.compile(r'<\s*begin\s>', re.IGNORECASE)
TAG_END   = re.compile(r'<\s*end\s>', re.IGNORECASE)
TAG_DATA  = re.compile(r'<\s*(\w+)\s*>\s*(.+)\s*<\s*\1\s*>')

def parse_to_json(fname, strict=True):
    records, curr_record = [], None
    with open(fname) as fp: 
        for line in fp:
            m = TAG_BEGIN.match(line.strip())
            if m is not None:
                if curr_record is not None:
                    raise Exception('previous object was not closed, missing <END> tag')
                curr_record = {}
                continue
            m = TAG_END.match(line.strip())
            if m is not None:
                if curr_record is None:
                    raise Exception('missing <BEGIN> tag')
                records.append(curr_record)
                curr_record = None
                continue
            m = TAG_DATA.match(line.strip())
            if m is not None:
                if curr_record is None:
                    raise Exception('previous object closed, missing <BEGIN> tag')
                curr_record[m[1]] = m[2]
    if strict:
        if curr_record is not None:
            raise Exception('reached EOF, missing <END> tag')
    else:
        if curr_record is not None:
            records.append(curr_record) # Just add what we got...
    return records

def map_to_point(record, x_field, y_field):
    return (
        float(record[x_field]) if x_field in record else 0.0,
        float(record[y_field]) if y_field in record else 0.0
    )

def parse_points(fname, x_field='x', y_field='y', strict=True):
    return list(map(lambda record: map_to_point(record, x_field, y_field), parse_to_json(fname, strict)))

if __name__ == '__main__':
    print(parse_points('data.txt', x_field='xval', y_field='yval', strict=True))

data.txt

< begin >
    < datapt > 1 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000097 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >

< begin >
    < datapt > 2 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000245 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >
Mr. Polywhirl
  • 42,981
  • 12
  • 84
  • 132
0

Here is a fairly generic regex solution.

It relies on breaking to file into blocks (delimited by < begin > and < end >):

Demo

And then parsing those key, value pairs in the blocks into a mapping:

Demo

Here is the total solution:

import re   

blk_reg=re.compile(r'^<[ \t]*begin[ \t]*>[^<]*([\s\S]*?)^<[ \t]*end[ \t]*>', flags=re.M)
data_reg=re.compile(r'<[ \t]*([^ \t>]*)[ \t]*>[ \t]*(\S+)[ \t]*<[ \t]*(?:\1)[ \t]*>')

with open (your_file) as f:
    for blk in blk_reg.finditer(f.read()):
        data={m.group(1):m.group(2) for m in data_reg.finditer(blk.group(1))}
        print(data)

Prints:

{'datapt': '1', 'xunit': 'mass', 'xval': '150.000097', 'yunit': 'abs', 'yval': '0.000000'}
{'datapt': '2', 'xunit': 'mass', 'xval': '150.000245', 'yunit': 'abs', 'yval': '0.000000'}

To create a CSV file, just use the individual elements of the dictionary produced from each block of tags.

dawg
  • 98,345
  • 23
  • 131
  • 206