you would use pandas library , it is easy-to-use data structures and data analysis tools for the Python.
installation:
on python 2
pip install pandas
on python 3
pip3 install pandas
the code:
this code would read specific columns from your file into pandas dataframe and then apply regex to last column and then save the data to new file.
# importing pandas
import pandas as pd
# import re library
import re
# use read_csv method to read your data file
# delimiter='\t' used if your file is tsp (tsv separated values)
# or delim_whitespace=True if your file use multiple white spaces
# or delimiter=r"[ ]{2,}" to use only more than 2 spaces as your last column uses space inside its value, actually we use regex here.
# usecols=[0,1,3,6] to load those columns only
# optionaly give names to your columns if there is no header in your file names=['colA', 'colB')
df = pd.read_csv('yourfile.txt', delimiter=r"[ ]{2,}", usecols=[0,1,3,6], names=['colA', 'colB', 'colC', 'colD'])
# we make our regex pattern here. thanks to @Kristian
pattern = r"\[([^\|]+)"
# define a simple regex function that will called for every value in your last column. or we could supply lambda to pandas apple method.
def regex_func(value):
return re.findall(pattern, value)
# apply regex to last column values
df['colD'] = df['colD'].apply(regex_func)
# print the results
print(df)
# save your dataframe to new file
# index=false to save df without row names
# header=False to save df without columns names
# sep='\t' to make it tab separated values
df.to_csv('yournewfile.csv', sep='\t', index=False, header=False)
as you see with pandas , you could use only few lines of code, no loops etc. clean and easy to maintain.
test-drive the code:
i'm copy paste the content of output file:
Sun - exst ['STA']
Moon - exst ['SAT']
Mars + exst ['PLAN']
Venus + exst ['PLAN']
Uranus - exst ['UNK', 'SAT', 'BLA']
Mercury + exst ['UNK', 'PLAN']
links:
official pandas docs:
http://pandas.pydata.org/pandas-docs/stable/
pandas Tutorials:
https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html
https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
https://www.tutorialspoint.com/python_pandas
update:
i noticed your file not (tab seprated values). it uses multiple white spaces. first i thought i may use delim_whitespace=True in read_csv method
df = pd.read_csv('yourfile.txt', delim_whitespace=True, usecols=[0,1,3,6], names=['colA', 'colB', 'colC', 'colD')
This helps when you have more than one space as delimiter.
but as your last column use single space in its values, and will give unexpected results in the output, so the proper way to parse columns correctly included last column is to use regex with delimiter arg , delimiter=r"[ ]{2,}"
df = pd.read_csv('yourfile.txt', delimiter=r"[ ]{2,}", usecols=[0,1,3,6], names=['colA', 'colB', 'colC', 'colD'])
update2
i'm update the code in my answer to shows how it easy to apply regex to a column when using pandas
simple one line will apply a function to every value from your last column
df['colD'] = df['colD'].apply(regex_func)
i included a regex function in my code for readability ,but it also can be simple lambda call like this
df['colD'] = df['colD'].apply(lambda value: re.findall(r"\[([^\|]+)", value))