Parse a text file with Python

Question

I have to parse a document in a text file that is in columns like this one:

  Sun       -    S    exst    sun      s    [STA|X|Away]
  Moon      -    M    exst    moon     s    [SAT|X|Not away]
  Mars      +    M    exst    mars     p    [PLAN|X|Away]
  Venus     +    V    exst    venus    p    [PLAN|X|Away]
  Uranus    -    U    exst    uranus   u    [UNK|X|Away], [SAT|X|Away], [BLA|X|Away]
  Mercury   +    M    exst    mercury  u    [UNK|X|Away], [PLAN|X|Away]

It has to make a new file that look like this at the end:

Sun        -     exst    ['STA']
Moon       -     exst    ['SAT']
Mars       +     exst    ['PLAN']
Venus      +     exst    ['PLAN']
Uranus     -     exst    ['UNK', 'SAT', 'BLA']
Mercury    +     exst    ['UNK', 'PLAN']

The exercise has the purpose of learning how to use regular expressions.

I have search on the web information about how to parse documents, but I cannot find any good ones that explain it well or that serves me, specially by the way my information is at the beginning (in columns). If you could help me to know how the code should be, explain me the syntax of how to parse, or give me links to information that could explain it to me I will be very glad.

Thanks!

Seems to be neither, but rather a fixed column width file format. — Jan Christoph Terasa, Feb 17 '19 at 20:18
[Try this introduction into regular expressions](https://stackoverflow.com/questions/4736/learning-regular-expressions). — Jan Christoph Terasa, Feb 17 '19 at 22:41

score 2 · Answer 1 · answered Feb 17 '19 at 22:27

Using regular expressions seems a bit awkward considering the input is in fixed record layout; nevertheless, the solution below utilizes regular expressions to save the transformations. Note, there is a two step process as I do not believe Python handles groups of groups, which would be necessary to identify the first elements of all the arrays in the last section of the record.

Use record_re to identify each field in the input line. Then use firsts regular expression to get the first element of each list found in the last field of the input line.

import sys
import re


class FixedTransform(object):
    fields = [
            "",
            "(?P<CELESTIAL_BODY>[^\s]+)",
            "(?P<SIGN>[-+])",
            "(?P<LETTER>.)",
            "(?P<EXST>exst)",
            "(?P<LOWER>[^\s]+)",
            "(?P<TYPE>[^\s])",
            "(?P<LIST>\[.*\])"
    ]

    record_re = re.compile(r"\s+".join(fields))
    firsts = r"\[([^\|]+)"

    def __init__(self, filein, fileout=sys.stdout):
        self.filein = filein
        self.fileout = fileout

    def raw_records(self):
        with open(self.filein, "r") as fin:
            for line in fin:
                yield line[:-1]

    def parsed_records(self):
        for line in self.raw_records():
            groups = self.record_re.match(line)
            if groups is not None:
                fields = groups.groupdict()
                last_group = fields.get("LIST")
                firstels = re.findall(self.firsts, last_group)
                fields["LIST"] = firstels
                yield fields

    def transform(self):
        fields_out = [
                "CELESTIAL_BODY",
                "SIGN",
                "EXST",
                "LIST"
        ]
        for doc in self.parsed_records():
            xform = {f: doc.get(f) for f in fields_out}
            yield xform

    def format_out(self, doc):
        return "{CELESTIAL_BODY:11s}{SIGN:6s}{EXST:8s}{LIST}".format(**doc)


if __name__ == "__main__":
    ft = FixedTransform("infile.txt")
    for doc in ft.transform():
        print(ft.format_out(doc))

I broke the regular expression into individual components for ease of reading and testing. This kept the expression in a manageable format and made it easy to update. As the fields are separated by whitespace, I simply combined the individual regular expressions using Python's str.join method before compiling the expression.

executing the code against the input presented in your question yields:

Sun        -     exst    ['STA']
Moon       -     exst    ['SAT']
Mars       +     exst    ['PLAN']
Venus      +     exst    ['PLAN']
Uranus     -     exst    ['UNK', 'SAT', 'BLA']
Mercury    +     exst    ['UNK', 'PLAN']

Sameh Farouk · Answer 2 · 2019-02-18T12:34:04.523

you would use pandas library , it is easy-to-use data structures and data analysis tools for the Python.

installation:

on python 2

pip install pandas

on python 3

pip3 install pandas

the code: this code would read specific columns from your file into pandas dataframe and then apply regex to last column and then save the data to new file.

# importing pandas
import pandas as pd

# import re library
import re

# use read_csv method to read your data file
# delimiter='\t' used if your file is tsp (tsv separated values)
# or delim_whitespace=True if your file use multiple white spaces
# or delimiter=r"[ ]{2,}" to use only more than 2 spaces as your last column uses space inside its value, actually we use regex here.
# usecols=[0,1,3,6] to load those columns only
# optionaly give names to your columns if there is no header in your file names=['colA', 'colB')
df = pd.read_csv('yourfile.txt', delimiter=r"[ ]{2,}", usecols=[0,1,3,6], names=['colA', 'colB', 'colC', 'colD'])


# we make our regex pattern here. thanks to @Kristian
pattern = r"\[([^\|]+)"

# define a simple regex function that will called for every value in your last column. or we could supply lambda to pandas apple method.


def regex_func(value):
    return re.findall(pattern, value)


# apply regex to last column values
df['colD'] = df['colD'].apply(regex_func)

# print the results
print(df)

# save your dataframe to new file
# index=false to save df without row names
# header=False to save df without columns names
# sep='\t' to make it tab separated values
df.to_csv('yournewfile.csv', sep='\t', index=False, header=False)

as you see with pandas , you could use only few lines of code, no loops etc. clean and easy to maintain.

test-drive the code:

i'm copy paste the content of output file:

Sun -   exst    ['STA']
Moon    -   exst    ['SAT']
Mars    +   exst    ['PLAN']
Venus   +   exst    ['PLAN']
Uranus  -   exst    ['UNK', 'SAT', 'BLA']
Mercury +   exst    ['UNK', 'PLAN']

links:

official pandas docs:

http://pandas.pydata.org/pandas-docs/stable/

pandas Tutorials:

https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html

https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python

https://www.tutorialspoint.com/python_pandas

update:

i noticed your file not (tab seprated values). it uses multiple white spaces. first i thought i may use delim_whitespace=True in read_csv method

df = pd.read_csv('yourfile.txt', delim_whitespace=True, usecols=[0,1,3,6], names=['colA', 'colB', 'colC', 'colD')

This helps when you have more than one space as delimiter.

but as your last column use single space in its values, and will give unexpected results in the output, so the proper way to parse columns correctly included last column is to use regex with delimiter arg , delimiter=r"[ ]{2,}"

df = pd.read_csv('yourfile.txt', delimiter=r"[ ]{2,}", usecols=[0,1,3,6], names=['colA', 'colB', 'colC', 'colD'])

update2

i'm update the code in my answer to shows how it easy to apply regex to a column when using pandas

simple one line will apply a function to every value from your last column

df['colD'] = df['colD'].apply(regex_func)

i included a regex function in my code for readability ,but it also can be simple lambda call like this

df['colD'] = df['colD'].apply(lambda value: re.findall(r"\[([^\|]+)", value))

Thank you, but I understood that I have to use regular expressions for selecting the information that I want to keep, not just save it by column but by their regular expressions — , Feb 17 '19 at 19:56
i think you will need both Regular expressions module and the pandas package, using re alone will complicate the code and it will be harder to get the result you want , also it will be bad design as code hard to maintain and not flex as when using library like pandas. — Sameh Farouk, Feb 17 '19 at 20:31
i updated my answer to shows you how to use one line of code to apply a function to all values on a column without using loops. — Sameh Farouk, Feb 18 '19 at 06:09
i included full example about how you could uses pandas with regex to achieve what you asked for exactly with only about 7 lines of codes. reputation is appreciated as i'm new to stack overflow and can't post comments anywhere yet. better if that was the answer you looking for please don't forget to mark my answer as accepted. thank you. — Sameh Farouk, Feb 18 '19 at 06:56

Parse a text file with Python

2 Answers2