Extract x and y data from a text

Question

I am very new to python and know a little bit about data structures and retrieving things, and I was wondering if someone could point me in the right direction. I have a text file that has a structure like this (with 17000 or so data points). It's not XML (although looks similar), the data looks exactly like this:

< begin >
    < datapt > 1 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000097 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >

< begin >
    < datapt > 2 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000245 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >

I would like to extract all the x values and the y values in this text (using python), and make a .csv file that has two separate columns for x values and the y values .

How would I begin to construct this particular code? Should I have an easier time with RegEx? Moreover, should I try to somehow convert this into an XML-like format to make it much easier to just slap it into excel?

Any pointers is appreciated!!

Is the text EXACTLY like this? It looks similar to XML but without proper closing tags. Is this actually XML? — dawg, Jun 18 '20 at 15:20
I have copied it EXACTLY the way it looks like, yes. The data is embedded in XML-like markup constructs, but isn't exactly XML. — jayjaytapir, Jun 18 '20 at 15:31

score 0 · Answer 1 · answered Jun 18 '20 at 15:35

Here's something not really efficient as it reads it all to memory and iterates over the lines two times, but it's easy to understand, hopefully - and 17000 points in this way should be doable by any modern machine:

f = open('myfile.txt','r') #Open the file
lines = f.readlines() #Read the lines into an array in memory
x_vals = [line.replace('< xval >','').strip() for line in lines if line.find('xval') != -1]
y_vals = [line.replace('< yval >','').strip() for line in lines if line.find('yval') != -1]
csv = open('myfile.csv','w') #Open the output CSV file for writing
csv.writelines(["%s,%s" % x for x in zip(x_vals,y_vals)]) #Unpack the tuples into a CSV line (May need to tweak the end of the string if doing it in different OSs)
csv.close()

Line 3 and 4 basically iterate over all the lines searching for < xval > and < yval > using find() which returns -1 if the string isn't found, then it replaces that string by an empty string and gets rid of the remaining spaces with strip()

And Line 6 creates a list of tuples from the lists obtained on lines 3 and 4 and joins them via the zip() function. We unpack the tuple using the % operator into the format string "%s,%s" - we might need to use a different line ending (ie, adding \r if running the program on Linux but using the file on Windows, for example) and maybe a different format string like a floating point one %f.

Hope this helps!

Thanks for the response. I used this particular code, but the output is:150.0000970.00000,150.000245 and so on. I only require the floats, and x and y values each separated on a column. This helped a lot though, thank you. — jayjaytapir, Jun 19 '20 at 18:19

curious_guy · Answer 2 · 2020-06-19T20:26:24.887

You can read each line of the sampledata file using file_object. When we see xval or yval, we parse the line to get the value.

Extracting values:

First, split the line (s.split('>')) which gives

["< xval ", " 150.000097 < xval >"

Do a second split on the 2nd val (s.split('>')[1].split('<')) above to get:

["[ 150.000097 ", " xval >"]

Finally, take the first val from above and remove leading/trailing whitespaces (s.split('>')[1].split('<')[0].strip()) to get:

150.000097

We finally write the values we extracted to a csv file.

Note: I have made sure to write the most readable version of the code. This can certainly be made a lot more compact if needed.

Code:

sampledata.txt

< begin >
    < datapt > 1 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000097 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >

< begin >
    < datapt > 2 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000245 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >

file_object = open("./sampledata.txt", 'r')

def getValFromLine(s):
    return s.split('>')[1].split('<')[0].strip()

x_list = []
y_list = []
for line in file_object:
    if 'xval' in line:
        x_list.append(getValFromLine(line))

    if 'yval' in line:
        y_list.append(getValFromLine(line))

print(x_list)
print(y_list)

# writing to csv file
csv = open('myfile.csv', 'w')
for i in range(min(len(x_list), len(x_list))):
    csv.write(f'{x_list[i]},{y_list[i]}\n')
csv.close()

Result:

['150.000097', '150.000245']
['0.000000', '0.000000']

myfile.csv

150.000097,0.000000
150.000245,0.000000

this looks like the best answer, thanks. would there be a way to plot each x, y pair in a different row? — jayjaytapir, Jun 18 '20 at 18:28
See https://stackoverflow.com/questions/11497376/how-do-i-specify-new-lines-on-python-when-writing-on-files#11497391 — Vinko Vrsalovic, Jun 18 '20 at 20:17
Added a `\n` to the end of each line while writing to csv and updated the result accordingly @jayjaytapir — curious_guy, Jun 19 '20 at 20:27

Mr. Polywhirl · Answer 3 · 2020-06-18T17:19:25.543

The easiest way to parse this would be to just grab all the x and y values and then zip them together.

Of course, there may be an issue where there is an x-value, but no corresponding y-value. In this case, you will have to know when you "begin" and "end" a new group of values.

#!/usr/bin/env python

import re

SAMPLE_TEXT = '''
< begin >
    < datapt > 1 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000097 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >

< begin >
    < datapt > 2 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000245 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >
'''

num_tag_pattern = lambda key: re.compile('<\s*' + key + '\s*>\s*(\d[.\d]+)\s*<\s*' + key + '\s*>')

x_pat, y_pat = num_tag_pattern('xval'), num_tag_pattern('yval')

def parse_points(text):
    x_vals, y_vals = [], []

    for line in text.strip().split('\n'):
        parse_float(line.strip(), x_pat, x_vals)
        parse_float(line.strip(), y_pat, y_vals)

    return list(zip(x_vals, y_vals))

def parse_float(line, pattern, values):
    m = pattern.match(line.strip())
    if m is not None:
        values.append(float(m[1]))

if __name__ == '__main__':
    print(parse_points(SAMPLE_TEXT))  # [(150.000097, 0.0), (150.000245, 0.0)]

Update

Here is a more generic, safer version that actually parses the source data into JSON, before converting it to points.

I even added raised exceptions if there are any missing tags.

main.py

#!/usr/bin/env python

import re

TAG_BEGIN = re.compile(r'<\s*begin\s>', re.IGNORECASE)
TAG_END   = re.compile(r'<\s*end\s>', re.IGNORECASE)
TAG_DATA  = re.compile(r'<\s*(\w+)\s*>\s*(.+)\s*<\s*\1\s*>')

def parse_to_json(fname, strict=True):
    records, curr_record = [], None
    with open(fname) as fp: 
        for line in fp:
            m = TAG_BEGIN.match(line.strip())
            if m is not None:
                if curr_record is not None:
                    raise Exception('previous object was not closed, missing <END> tag')
                curr_record = {}
                continue
            m = TAG_END.match(line.strip())
            if m is not None:
                if curr_record is None:
                    raise Exception('missing <BEGIN> tag')
                records.append(curr_record)
                curr_record = None
                continue
            m = TAG_DATA.match(line.strip())
            if m is not None:
                if curr_record is None:
                    raise Exception('previous object closed, missing <BEGIN> tag')
                curr_record[m[1]] = m[2]
    if strict:
        if curr_record is not None:
            raise Exception('reached EOF, missing <END> tag')
    else:
        if curr_record is not None:
            records.append(curr_record) # Just add what we got...
    return records

def map_to_point(record, x_field, y_field):
    return (
        float(record[x_field]) if x_field in record else 0.0,
        float(record[y_field]) if y_field in record else 0.0
    )

def parse_points(fname, x_field='x', y_field='y', strict=True):
    return list(map(lambda record: map_to_point(record, x_field, y_field), parse_to_json(fname, strict)))

if __name__ == '__main__':
    print(parse_points('data.txt', x_field='xval', y_field='yval', strict=True))

data.txt

< begin >
    < datapt > 1 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000097 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >

< begin >
    < datapt > 2 < datapt >
    < xunit > mass < xunit >
    < xval > 150.000245 < xval >
    < yunit > abs < yunit >
    < yval > 0.000000 < yval >
< end >

dawg · Accepted Answer · 2020-06-18T23:22:10.203

Here is a fairly generic regex solution.

It relies on breaking to file into blocks (delimited by < begin > and < end >):

Demo

And then parsing those key, value pairs in the blocks into a mapping:

Demo

Here is the total solution:

import re   

blk_reg=re.compile(r'^<[ \t]*begin[ \t]*>[^<]*([\s\S]*?)^<[ \t]*end[ \t]*>', flags=re.M)
data_reg=re.compile(r'<[ \t]*([^ \t>]*)[ \t]*>[ \t]*(\S+)[ \t]*<[ \t]*(?:\1)[ \t]*>')

with open (your_file) as f:
    for blk in blk_reg.finditer(f.read()):
        data={m.group(1):m.group(2) for m in data_reg.finditer(blk.group(1))}
        print(data)

Prints:

{'datapt': '1', 'xunit': 'mass', 'xval': '150.000097', 'yunit': 'abs', 'yval': '0.000000'}
{'datapt': '2', 'xunit': 'mass', 'xval': '150.000245', 'yunit': 'abs', 'yval': '0.000000'}

To create a CSV file, just use the individual elements of the dictionary produced from each block of tags.

Extract x and y data from a text

4 Answers4

Update

main.py

data.txt