find a pattern in html and replace it with php code

Question

I am looking at finding this pattern

<!-- Footer part at bottom of page-->
<div id="footer">
   <div class="row col-md-2 col-md-offset-5">

    <p class="text-muted">&copy; 2014. Core Team</p>
  </div>

    <div id="downloadlinks">
    <!-- downloadlinks go here-->
    </div>
</div>

and replacing it with this pattern for a number of .html files

<!-- Footer part at bottom of page-->
<div id="footer">
    <div class="row col-md-2 col-md-offset-5">
       <?php
            $year = date("Y");
            echo "<p class='text-muted'>© $year. Core Team</p>";
        ?>
    </div>

    <div id="downloadlinks">
    <!-- downloadlinks go here-->
    </div>
</div>

Note the difference is that this

<p class="text-muted">&copy; 2014. Core Team</p>

is replaced with

       <?php
            $year = date("Y");
            echo "<p class='text-muted'>© $year. Core Team</p>";
        ?>

I was looking at doing it with sed but having had an initial attempt, my difficulty is the characters I might or might or might not have to escape. Also the tabs or new lines in the php code, I would like that to appear as is here.

There is a number of files to do it to so I would like to automate it but it might be quicker to just do it manually(copy and paste). But maybe sed is the wrong approach in this instance. Can someone kindly direct me in the right direction? At this stage I am open to other languages (e.g. php, python, bash ) to find a solution.

I would then plan to rename each .html file to .php with the following:

for i in *.html; do mv "$i" "${i%.*}.php"; done;

EDIT1

bsed on the awk answer below I can get it to work under this version

$ awk -Wversion 2>/dev/null || awk --version
GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2, GNU MP 6.0.0)
Copyright (C) 1989, 1991-2014 Free Software Foundation.

however on this version I get different output. It seems it prints out the 3 files, old new and file. Is this easily rectified in this version?

root@4461f768e343:/github/find_pattern# awk -Wversion 2>/dev/null || awk --version
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

root@4461f768e343:/github/find_pattern#
root@4461f768e343:/github/find_pattern#
root@4461f768e343:/github/find_pattern# awk -v RS='^$' -v ORS= 'ARGIND==1{old=$0;next} ARGIND==2{new=$0;next} s=index($0,old){ $0 = substr($0,1,s-1) new substr($0,s+length(old))} 1' old new file
<!-- Footer part at bottom of page-->
<div id="footer">
   <div class="row col-md-2 col-md-offset-5">

    <p class="text-muted">&copy; 2014. Core Team</p>
  </div>

    <div id="downloadlinks">
    <!-- downloadlinks go here-->
    </div>
</div><!-- Footer part at bottom of page-->
<div id="footer">
    <div class="row col-md-2 col-md-offset-5">
       <?php
            $year = date("Y");
            echo "<p class='text-muted'>© $year. Core Team</p>";
        ?>
    </div>

    <div id="downloadlinks">
    <!-- downloadlinks go here-->
    </div>
</div>some pile of text
or other
<!-- Footer part at bottom of page-->
<div id="footer">
   <div class="row col-md-2 col-md-offset-5">

    <p class="text-muted">&copy; 2014. Core Team</p>
  </div>

    <div id="downloadlinks">
    <!-- downloadlinks go here-->
    </div>
</div>
and more maybe.root@4461f768e343:/github/find_pattern#

I suggest to usa an XML/HTML parser (xmllint, xmlstarlet ...) because [only Chuck Norris can parse HTML with regex](http://stackoverflow.com/a/1732454/3776858). — Cyrus, Aug 11 '16 at 04:15

score 2 · Answer 1 · answered Aug 11 '16 at 04:17

2

You can use replace.

html_files = ['a.html', ...]
copyright = '<p class="text-muted">&copy; 2014. Core Team</p>'
new_copyright = """       <?php
        $year = date("Y");
        echo "<p class='text-muted'>© $year. Core Team</p>";
    ?>"""
for html_file_path in html_files:
    with open(html_file_path) as html_file:
        html = html_file.read()

    if copyright in html:
        php_file_path = html_file_path.replace('.html', '.php')
        with open(php_file_path, "w") as php_file:
            php = html.replace(copyright, new_copyright)
            php_file.write(php)

Note this will not override your html files which is useful if the script has an error.

answered Aug 11 '16 at 04:17

Doron Cohen

1,026
8
13

tried that but got this ```$ python find_pattern.py File "find_pattern.py", line 5 SyntaxError: Non-ASCII character '\xc2' in file find_pattern.py on line 6, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details``` – HattrickNZ Aug 12 '16 at 00:08
tks. had to add `# -*- coding: utf-8 -*-` this at the top of the file – HattrickNZ Aug 12 '16 at 02:48

Ed Morton · Answer 2 · 2016-08-12T03:20:23.093

2

sed is for simple substitutions on individual lines so your task is certainly not a job for sed. You could use awk if your files are all that well formatted:

$ cat old
<!-- Footer part at bottom of page-->
<div id="footer">
   <div class="row col-md-2 col-md-offset-5">

    <p class="text-muted">&copy; 2014. Core Team</p>
  </div>

    <div id="downloadlinks">
    <!-- downloadlinks go here-->
    </div>
</div>

.

$ cat new
<!-- Footer part at bottom of page-->
<div id="footer">
    <div class="row col-md-2 col-md-offset-5">
       <?php
            $year = date("Y");
            echo "<p class='text-muted'>© $year. Core Team</p>";
        ?>
    </div>

    <div id="downloadlinks">
    <!-- downloadlinks go here-->
    </div>
</div>

.

$ cat file
some pile of text
or other
<!-- Footer part at bottom of page-->
<div id="footer">
   <div class="row col-md-2 col-md-offset-5">

    <p class="text-muted">&copy; 2014. Core Team</p>
  </div>

    <div id="downloadlinks">
    <!-- downloadlinks go here-->
    </div>
</div>
and more maybe.

.

$ awk -v RS='^$' -v ORS= 'ARGIND==1{old=$0;next} ARGIND==2{new=$0;next} s=index($0,old){ $0 = substr($0,1,s-1) new substr($0,s+length(old))} 1' old new file
some pile of text
or other
<!-- Footer part at bottom of page-->
<div id="footer">
    <div class="row col-md-2 col-md-offset-5">
       <?php
            $year = date("Y");
            echo "<p class='text-muted'>© $year. Core Team</p>";
        ?>
    </div>

    <div id="downloadlinks">
    <!-- downloadlinks go here-->
    </div>
</div>
and more maybe.

The above uses GNU awk for multi-char RS and ARGIND. If you want to do it for many files you could use:

find . -type f -name '*.php' -exec awk -i inplace -v RS='^$' -v ORS= 'ARGIND==1{old=$0;print;next} ARGIND==2{new=$0;print;next} s=index($0,old){ $0 = substr($0,1,s-1) new substr($0,s+length(old))} 1' old new {} \;

or similar.

edited Aug 12 '16 at 03:20

answered Aug 11 '16 at 05:19

Ed Morton

188,023
17
78
185

that's a bit advanced awk for me, but tks currently looking in to this further for RS ORS and ARGIND [here](http://www.thegeekstuff.com/2010/01/8-powerful-awk-built-in-variables-fs-ofs-rs-ors-nr-nf-filename-fnr/?ref=binfind.com/web) and [here](http://www.delorie.com/gnu/docs/gawk/gawk_117.html) – HattrickNZ Aug 11 '16 at 21:35
1-for doing multiple files do I just need the `old` and `new` files? – HattrickNZ Aug 11 '16 at 21:36
2-how sensitive is this, for instance does it allow for spaces slightly off or other slight differences? – HattrickNZ Aug 11 '16 at 21:36
3-could you give a brief explaination of how this works? – HattrickNZ Aug 11 '16 at 21:36
can you see my edit1, I am getting different output with different version, is this easily fixed? – HattrickNZ Aug 11 '16 at 23:45
1

Yes but why bother since you have gawk? I added an example anyway. – Ed Morton Aug 11 '16 at 23:55
have gawk on cygwin on my windows PC, but wanted to be able to do it in a container i have whcih has the diff version. which is painful for me with the diff versions but it is something i would like to get a better handle on. advice welcome? And is gawk better?tks – HattrickNZ Aug 12 '16 at 00:20
Yes, gawk has far more useful extensions, is extremely well documented and supported, and has a large user base. – Ed Morton Aug 12 '16 at 00:21
I can't get the example you added does not work(it seems to just pring the orig file). i've tried changing the parans.tks – HattrickNZ Aug 12 '16 at 03:13
1

Yeah, it'd need more work - you need to build up the record one line at a time for every file, not just the last one. Or you could pick come control character that you know won't be in the file for the RS instead of `^$` and stick closer to the gawk code, some people rely on `RS='\0'` but YMMV. I'll delete that example as I'm not interested in trying to make it work in non-gawk given it's not as simple as I first thought. If you try it and have any specific questions feel free to post a follow up and I'll try to help you. – Ed Morton Aug 12 '16 at 03:19

find a pattern in html and replace it with php code

EDIT1

2 Answers2