shell or PHP remove HTML comment if the word match in the same paragraph

Question

I need to verify if the words in the HTML comments are included in the same line, in this case, delete the comment. Otherwise, keep the comment.

At the same time, the script needs to ignore the pronouns, adverbs, articles. I already have a list and is over 100 hundreds words. Like this:

"the", "this", "I", "me", "you", "she", "her", "he", "him", "it", "they", "them", "that", "which", etc...

This is an example of one line:

text <!-- They are human # life --> text text <!-- the rights --> text the human text

After running the script:

text text text <!-- the rights --> text the human text

Resume:

in the same line can be many comments, not only one.
the script needs to ignore my list of pronouns, adverbs, etc...
the script needs to ignore the words to other comments.
not sensitive case.
the files have over one thousand lines.
usually in the comments I have this character # (I hope is not a problem).

score 1 · Accepted Answer · answered Sep 18 '19 at 16:05

As others have mentioned, you should show some research, tell what you've tried and why it didn't work.

That being said, I found this to be a fun little challenge, so I decided to give it a go.

I assumed there are two files, "file.html" which we want to modify, and "words.txt" which lists the words to ignore separated by newlines (\n). This script should do the trick:

#!/bin/bash

FILE="file.html"
WORDS="words.txt"

#Set array delimiter to '\n':
IFS=$'\n'

#Find all comments within the file:
comments="$(cat $FILE | grep -oP '<!--[^<]+-->' | sort | uniq)"

for comment in $comments; do

  #Words In Comment. Gets all words in the comment.
  wic="$(echo $comment | head -1 | grep -oP '[^\s]+' | grep -v '<' | grep -v '>')"

  words="$(cat $WORDS)"

  #Filtered Words. It's $wic without any of the words in words.txt
  fw="$(echo $wic $words $words | tr ' ' '\n' | sort | uniq -u)"

  #if any remain
  if [ ! -z "$fw" ]
  then

    for word in $fw; do
      #Gets all lines with both the comment and the word outside the comment 
      lines="$(cat $FILE | grep -P "$comment.+$word|$word.+$comment")"

      #If it finds any
      if [ ! -z "$lines" ]
      then
        for line in $lines; do

          #Generate the replacement line
          replace="$(echo $line | sed "s/$comment//g")"

          #Replace the line with the replacement in the file
          sed -i "s/$line/$replace/g" $FILE

        done
      fi
    done
  fi
done

It's not perfect but gets the job done. Tested it on a file with the following contents:

text <!-- foo # --> foo
text <!-- bar # --> foo
text <!-- bar # --> bar
text <!-- bar # --> text <!-- something # --> something bar
text <!-- foo # --> text <!-- bar # --> text foo bar

Using the following words.txt:

foo

And got the expected result:

text <!-- foo # --> foo
text <!-- bar # --> foo
text  bar
text  text  something bar
text <!-- foo # --> text  text foo bar

sorry for not following the rules, looks amazing the script, but I try to run in my mac and doesn't work. I was reading other post, and I play to change the quotations and other things, but still not working. One of the errors said: usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] and after change the single quotation, to double said: line 10: !--[^: No such file or directory — fireDevelop.com, Sep 18 '19 at 20:00
Yeah, I did this on Linux. Apparently grep works slightly differently on mac. This post should help: https://stackoverflow.com/questions/16658333/grep-p-no-longer-works-how-can-i-rewrite-my-searches — 3snoW, Sep 19 '19 at 10:22
Thanks, 3snoW your code is amazing. Also the solution in PHP for others OS. The conversion grep linux to grep OSx I have not succeeded. — fireDevelop.com, Sep 20 '19 at 23:49

score 0 · Answer 2 · answered Sep 20 '19 at 23:43

Here the solution in PHP:

#!/usr/bin/php -q /* -*- c -*- */
<?php
/** usage from command line:
 *          php index.php input.html words.txt
 *  where   input.html is the book file
 *  and     words.txt is a file with excluded words (one on each line)
 *
 *  result will be in file out_input.html
 */

$transforming = false;

// input and excluded words must be submitted
if (isset($argv[1]) && isset($argv[2])) {
    $transforming = true;

    $inputFilename = $argv[1];
    $inputFile = fopen($inputFilename, "r") or die('Input file not found');

    $excludedWordFilename = $argv[2];
    $excludedWordsFile = fopen($excludedWordFilename, 'r') or die('Excluded words file not found');
    // load excluded words
    $excludedWords = [];
    while (! feof($excludedWordsFile)) {
        $excludedWords[] = fgets($excludedWordsFile);
    }

    $outputLines = [];

    // read input file line by line
    while (! feof($inputFile)) {
        $line = fgets($inputFile);
        $outputLines[] = process($line, $excludedWords);
    }
    // write result to file
    $outputFile = implode(PHP_EOL, $outputLines);
    $outputFilename = 'out_'.$inputFilename;
    file_put_contents($outputFilename, $outputFile);

} else {
    echo 'no file, please use this format: php index.php "inputfile.html" "excludedwords.txt"';
}


function process($line, $excludedWords)
{
    // splits the line into comments and non-comment parts
    $lineParts = preg_split('/(<!--.+?-->)/msi', $line, 0, PREG_SPLIT_NO_EMPTY + PREG_SPLIT_DELIM_CAPTURE);
    // extract all comments from the line
    $lineComments= preg_grep('/<!--.+?-->/', $lineParts);
    // And keep the non comment part of the line for word comparison
    $lineText = implode(' ', preg_grep('/<!--.+?-->/', $lineParts, PREG_GREP_INVERT));

    // get the original comment tags and trimmed comment words within it
    preg_match_all('/<!--[\s](.+?)[\s]-->/msi', implode(' ', $lineComments), $comments);
    list($commentTags, $commentTexts) = $comments;
    $comments = array_combine($commentTags, $commentTexts);

    // explode each words in the comment and clean from excluded words
    foreach ($comments as $tag => $words) {
        $moreWordsToCheck = preg_split('`[\s,#]+`', $words);
        foreach ($moreWordsToCheck as $wordToCheck) {
            // check if word in exclude list
            if (! in_array($wordToCheck, $excludedWords)) {
                if (stripos($lineText, $wordToCheck)) {
                    $line = str_replace($tag, '', $line);
                }
            }
        }
    }

    return $line;
}

also here a example of the document words.txt for spanish speakers with the most of pronouns and so for:

a
a cuál
a cuáles
a lo mejor
a qué
a quién
a quiénes
acaso
además
ahí
ahora
algo
algún
alguna
algunas
alguno
algunos
allí
alrededor
ante
anteayer
antes
aparte
aquel
aquella
aquellas
aquello
aquellos
aquí
así
asimismo
aún
ayer
bajo
bastante
bastantes
bien
cabe
cada
casi
cerca
como
con
contra
cuál
cuáles
cuanta
cuánta
cuantas
cuántas
cuanto
cuánto
cuantos
cuántos
cuya
cuyas
cuyo
cuyos
de
debajo
delante
demasiado
dentro
deprisa
desde
despacio
después
detrás
durante
el
él
el cual
el mío
el nuestro
el que
el suyo
el tuyo
el vuestro
ella
ellas
ellos
en
encima
entre
esa
esas
ese
eso
esos
esta
estas
este
esto
estos
fuera
hacia
hasta
hoy
incluso
jamás
la
la cual
la mía
la nuestra
la que
la suya
la tuya
la vuestra
las
las cuales
las mías
las nuestras
las que
las suyas
las tuyas
las vuestras
le
lejos
les
lo
los
los cuales
los míos
los nuestros
los que
los suyos
los tuyos
los vuestros
luego
mal
más
me
mediante
medio
menos
mi
mía
mías
mío
míos
mis
mucho
muy
nada
ningún
ninguna
ningunas
ninguno
ningunos
no
nos
nosotras
nosotros
nuestra
nuestras
nuestro
nuestros
nunca
os
otra
otras
otro
otros
para
poco
por
pronto
que
qué
quien
quién
quienes
quiénes
quizá
quizás
se
según
sendas
sendos
sí
sin
so
sobre
su
sus
suya
suyas
suyo
suyos
tal vez
también
tampoco
tanta
tantas
tanto
tantos
tarde
te
temprano
toda
todas
todavía
todo
todos
tras
tu
tú
tus
tuya
tuyas
tuyo
tuyos
un
una
unas
unos
usted
ustedes
varias
varios
versus
vía
vos
vosotras
vosotros
vuestra
vuestras
vuestro
vuestros
ya
yo

shell or PHP remove HTML comment if the word match in the same paragraph

Resume:

2 Answers2