Using bash to clean up data formatting

Question

I'm new to bash scripting and need some help with a project I am working on. I am trying to use bash shell scripting to edit a .txt file with data from a database into a more useful format.

The data is currently in the following format (the file has several thousand pieces of data like this one):

DATA:|11.00000|000563784644|7031450|7031450||1.000000|1.000000|0.000000|0.000000|0.000000|21.000000|47.040000|60.480000|0.000000|0.000000|0.000000|0.000000|0.000000|0.000000|1.000000|100.000000

I would like to remove the "DATA:" prefix from each piece of data in the file, add appropriate date information, and reformat parts of the data to be in the following final format:

2017/01/27|0011|000563784644|7031450|7031450||1|1|0|0.00|0.00|21|47.04|60.48|0|0|0|0 |0.00|0.00|1|100

I have figured out how to iterate over each piece of data in the file like this:

    while read p; do
    ...
    done <peptides.txt

But I am struggling with how to modify parts of each 'piece' of data (in a sense, indexing each part by using the '|" as a delimiter).

Would it be best to write a program in C to set each data piece as an array and then work with it, or use bash commands to edit the data strings?

For a quick-and-dirty one-time work, you can do that with `awk`. If it is part of an ongoing project which will eventually manipulate the data at a higher level than pure reformatting, then `python` or a similar scripting language might be better. `bash` virtuosos can do that in `bash`, but it’s usually much less efficient. Please specify what you prefer. — Dario, Mar 16 '18 at 17:53
@Dario I am more familiar using Python, however this is just a one-time project, so I am looking to just work through this data set using bash. — Alex P, Mar 16 '18 at 18:02
Use this to split your line into an array: https://stackoverflow.com/questions/918886/how-do-i-split-a-string-on-a-delimiter-in-bash — Nic3500, Mar 16 '18 at 19:05

Samit · Accepted Answer · 2018-03-16T20:20:06.397

You can use the below script to achieve your requirement

while read line ; do
for i in `echo $line |sed  "s/||/|empty|/g" | tr '|' '\n'`; do if [[ $i =~ [0-9] ]];then printf "%.2f\n" $i ; else printf "$i\n"; fi ; done  | tr '\n' '|' | sed "s/\.00//g" | sed "s/DATA:/$(date +%F)/g" | tr '-' '\/' | sed  "s/|empty|/||/g"; printf "\n"
done < input.txt > output.txt

I have tested the script and used input.txt as the input file and output.txt as the final output file as per your requirement.

The contents of the files are as below:

input.txt

cat input.txt 
DATA:|11.00000|000563784644|7031450|7031450||1.000000|1.000000|0.000000|0.000000|0.000000|21.000000|47.040000|60.480000|0.000000|0.000000|0.000000|0.000000|0.000000|0.000000|1.000000|100.000000
DATA:|31.00000|0005784644|7031450|73333450||1.0340000|1.000000|0.03000|0.000000|0.020000|21.000000|47.040000|60.480000|0.000000|0.000000|0.000000|0.000000|0.000000|0.000000|1.000000|100.000000
DATA:|11.00000|000563784644|7031450|7031450||1.000000|1.000000|0.000000|0.000000|0.200000|21.000000|47.040000|60.480000|0.000000|0.000000|0.000000|0.000000|0.000000|0.000000|1.000000|100.000000
DATA:|11.00200|000563784644|7031450|7031420||1.010000|1.000000|0.000000|0.000000|0.000000|21.000000|47.040000|60.480000|0.000000|0.000000|0.000000|0.000000|0.000000|0.000000|1.000000|100.001000

output.txt

cat output.txt 
2018/03/17|11|563784644|7031450|7031450||1|1|0|0|0|21|47.04|60.48|0|0|0|0|0|0|1|100|
2018/03/17|31|5784644|7031450|73333450||1.03|1|0.03|0|0.02|21|47.04|60.48|0|0|0|0|0|0|1|100|
2018/03/17|11|563784644|7031450|7031450||1|1|0|0|0.20|21|47.04|60.48|0|0|0|0|0|0|1|100|
2018/03/17|11|563784644|7031450|7031420||1.01|1|0|0|0|21|47.04|60.48|0|0|0|0|0|0|1|100|

Hope this will fullfill your requirement :)

Using bash to clean up data formatting

1 Answers1