Remove extra lines and spaces from xml file using shell script

Question

I have xml file with lots of data in it. But some of the tags has been on another line instead of on same line. I need to do this using shell script

Input

<lineid>Product 
testing machine 
</lineid>

Expected Output

<lineid>Product testing machine </lineid>

In input I have given the extra line as input is also showing as same as output .

The input data is not in single line and i want it in single line , also want to do the changes in same file.

The input you've shared does not look like valid XML to me. Also, please share your attempts to resolve the problem — Nico Haase, Aug 28 '23 at 14:10
yes , i dont know the tags are not showing after the post , they are already in the actual post — Amey K, Aug 28 '23 at 14:11
Do you need to output XML? Or just extracting `Product testing machine ` would be enough? — Fravadona, Aug 28 '23 at 14:33

score 1 · Answer 1 · edited Aug 30 '23 at 09:34

1

This should put everything into one line and remove extra spaces. It expects a filename as argument. So, if you save this script as formatter.sh and input file as input.txt you would call it as:

./formatter.sh input.txt

The output gets saved to the same file, so make sure to try it on a copy!

#!/bin/bash

input_file="$1"  # Replace with the path to your input file

if [ -f "$input_file" ]; then
    input=$(cat "$input_file")
    formatted=$(echo "$input" | tr -d '\n' | sed -e 's/ *$//' -e 's/  */ /g')
    echo "$formatted" > "$input_file"
else
    echo "Input file not found: $input_file"
fi

edited Aug 30 '23 at 09:34

Adrian Mole

49,934
160
51
83

answered Aug 28 '23 at 14:40

hermanoff

11
4

Tried but not working, it is only giving me single line as output – Amey K Aug 28 '23 at 15:32
This script turns multi-line text into a one-line text. I don't understand then...what exactly does this script have to do? – hermanoff Aug 28 '23 at 17:11
it only gave me single tag in output and other xml tags are gone – Amey K Aug 29 '23 at 07:34

theSparky · Answer 2 · 2023-08-30T17:59:45.813

As I understand your request, tags of simple XML can be condensed with something like this:

#!/bin/bash

if [ $# -lt 1 ]; then echo "no file provided"; exit 1; fi
xml_input="$1"
if [ ! -r ${xml_input} ]; then echo "file not readable"; exit 1; fi
xml_temp="$(mktemp /tmp/${xml_input}.XXXXXXXXX)" || exit 1

tr '\n' ' ' < "${xml_input}" > "${xml_temp}"
sed -i 's/\r/ /g' "${xml_temp}"
sed -i 's/  */ /g' "${xml_temp}"
sed -i 's/?> /?>/g' "${xml_temp}"
sed -i 's/?>/?>\n/g' "${xml_temp}"
sed -i 's/> </>\n</g' "${xml_temp}"
mv "${xml_temp}" "${xml_input}"

which will convert:

<?xml version="1.0" encoding="UTF-8"?><root>

<lineid>
     Product  
     testing machine  
     
     </lineid>
                    <lineid>Product testing machine

                    </lineid>
    </root>

to:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<lineid> Product testing machine </lineid>
<lineid>Product testing machine </lineid>
</root>

but a proper shell script to do that for all XML cases would be huge, or just a caller for an actual parser written in another language. There are a lot of good explanations:

https://stackoverflow.com/a/8577108/1919793

Can you provide some examples of why it is hard to parse XML and HTML with a regex?

Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

and many text editors will do this a lot better for you:

How do I format XML in Notepad++?

Remove extra lines and spaces from xml file using shell script

2 Answers2