0

I want to remove the newline before the </script> in my HTML file with a Linux command (sed, awk...).

Sample input:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        <title>JavaScript Ders 2</title>
        <script type="text/javascript" src="script1.js" language="javascript"> 
        </script>
        <script type="text/javascript" src="script2.js" language="javascript"> 
        </script>
        <script>
            // script kodumuz buraya yazılacak
        </script>
    </head>
    <body>
        <script type="text/javascript" src="script3.js" language="javascript"> 
        </script>
    </body>
</html>

Sample output:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        <title>JavaScript Ders 2</title>
        <script type="text/javascript" src="script1.js" language="javascript"> </script>
        <script type="text/javascript" src="script2.js" language="javascript"> </script>
        <script>
        // script kodumuz buraya yazılacak</script>
    </head>
    <body>
        <script type="text/javascript" src="script3.js" language="javascript"> </script>
    </body>
</html>

I tried different syntax, but none of them could do.

zx485
  • 28,498
  • 28
  • 50
  • 59
unh
  • 61
  • 5
  • 4
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Jul 18 '18 at 17:14
  • Please avoid *"Give me the codez"* questions. Instead show the script you are working on and state where the problem is. Also see [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/608639) – jww Jul 18 '18 at 17:18

2 Answers2

1

First of all, as mentioned in the comments Don't parse XML with Regex! Never do it, never think about it. Make it a habit not to think about it! Sometimes it might look to be a simple task that can be performed with or or any other regex parser, but no ...

What you can do, on the other hand—if you really want to use or — processes the file first with and convert it into a PYX format.

The PYX format is a line-oriented representation of XML documents that is derived from the SGML ESIS format. (see ESIS - ISO 8879 Element Structure Information Set spec, ISO/IEC JTC1/SC18/WG8 N931 (ESIS))

So what you realy want to do is something like :

$ xmlstarlet pyx <file.html> | do_your_magic_here | xmlstarlet depyx > file.new.html

In your case this would be something like:

$ xmlstarlet pyx file.html \
  | awk 'c~/^- *\\n *$/&&/^)script$/{c=$0;next}{print c; c=$0}END{print c}' \
  | xmlstarlet depyx

This will output

<html>
    <head>
        <meta content="text/html; charset=utf-8" http-equiv="Content-Type"></meta>
        <title>JavaScript Ders 2</title>
        <script language="javascript" src="script1.js" type="text/javascript"></script>
        <script language="javascript" src="script2.js" type="text/javascript"></script>
        <script>
            // script kodumuz buraya yazılacak
        </script>
    </head>
    <body>
        <script language="javascript" src="script3.js" type="text/javascript"></script>
    </body>
</html>
kvantour
  • 25,269
  • 4
  • 47
  • 72
-2

This might work for you (GNU sed):

sed 'N;s/\n\(<\/script>\)/\1/;P;D' file

Keep a window of two lines throughout the file and if the second line begins with </script>, remove the preceding newline.

potong
  • 55,640
  • 6
  • 51
  • 83