Remove newline before a match - Linux

Question

I want to remove the newline before the </script> in my HTML file with a Linux command (sed, awk...).

Sample input:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        <title>JavaScript Ders 2</title>
        <script type="text/javascript" src="script1.js" language="javascript"> 
        </script>
        <script type="text/javascript" src="script2.js" language="javascript"> 
        </script>
        <script>
            // script kodumuz buraya yazılacak
        </script>
    </head>
    <body>
        <script type="text/javascript" src="script3.js" language="javascript"> 
        </script>
    </body>
</html>

Sample output:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        <title>JavaScript Ders 2</title>
        <script type="text/javascript" src="script1.js" language="javascript"> </script>
        <script type="text/javascript" src="script2.js" language="javascript"> </script>
        <script>
        // script kodumuz buraya yazılacak</script>
    </head>
    <body>
        <script type="text/javascript" src="script3.js" language="javascript"> </script>
    </body>
</html>

I tried different syntax, but none of them could do.

[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Jul 18 '18 at 17:14
Please avoid *"Give me the codez"* questions. Instead show the script you are working on and state where the problem is. Also see [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/608639) — jww, Jul 18 '18 at 17:18

score 1 · Accepted Answer · answered Jul 19 '18 at 15:27

First of all, as mentioned in the comments Don't parse XML with Regex! Never do it, never think about it. Make it a habit not to think about it! Sometimes it might look to be a simple task that can be performed with sed or awk or any other regex parser, but no ...

What you can do, on the other hand—if you really want to use sed or awk — processes the file first with xmlstarlet and convert it into a PYX format.

The PYX format is a line-oriented representation of XML documents that is derived from the SGML ESIS format. (see ESIS - ISO 8879 Element Structure Information Set spec, ISO/IEC JTC1/SC18/WG8 N931 (ESIS))

So what you realy want to do is something like :

$ xmlstarlet pyx <file.html> | do_your_magic_here | xmlstarlet depyx > file.new.html

In your case this would be something like:

$ xmlstarlet pyx file.html \
  | awk 'c~/^- *\\n *$/&&/^)script$/{c=$0;next}{print c; c=$0}END{print c}' \
  | xmlstarlet depyx

This will output

<html>
    <head>
        <meta content="text/html; charset=utf-8" http-equiv="Content-Type"></meta>
        <title>JavaScript Ders 2</title>
        <script language="javascript" src="script1.js" type="text/javascript"></script>
        <script language="javascript" src="script2.js" type="text/javascript"></script>
        <script>
            // script kodumuz buraya yazılacak
        </script>
    </head>
    <body>
        <script language="javascript" src="script3.js" type="text/javascript"></script>
    </body>
</html>

score -2 · Answer 2 · answered Jul 18 '18 at 17:51

-2

This might work for you (GNU sed):

sed 'N;s/\n\(<\/script>\)/\1/;P;D' file

Keep a window of two lines throughout the file and if the second line begins with </script>, remove the preceding newline.

answered Jul 18 '18 at 17:51

potong

55,640
6
51
83

Remove newline before a match - Linux

2 Answers2