0

I need to remove blocks of text from thousands of html files using a string match and ending with a specific tag. I am looking for id="123" in the line <div class="list-group-item" id="123"> as my starting string and then </div>#13#10</div> as then end tag, where the #13#10 are carriage return and line feed. Every file will have a different search string, but all blocks will end with the same end tag. I read the search strings (123, 456, 789) from another text file where each search string is on its own line. Can someone please provide me with the most efficient way of doing this?

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • Am I right in thinking you want to remove the entire tag - opening having specified id and the matching closing tag? If so, you're honestly going to be better off using an HTML parser as to prevent potential issues with finding the matching closing tag. – SEoF Jan 21 '22 at 14:16
  • I want to remove a block of code. And it must be done with php. – harry_belefonte Jan 21 '22 at 14:22
  • How many ids do you have in your list of ids to delete? If it's a small list then we can use one regex to search them `
    .*?
    \r\n`. If it's a big list, we might have to do several groups of searches. Are your HTML files stored in the same directory or in several sub-directories? If it's flat then a simple `glob('*.htm')` can be used. If not, you'll have to write a [recursive search in the sub-folders](https://stackoverflow.com/questions/17160696/php-glob-scan-in-subfolders-for-a-file).
    – Patrick Janser Jan 21 '22 at 14:40
  • There are thousands of removal IDs and all in different directories. We have a file with an ID and the file name. – harry_belefonte Jan 21 '22 at 14:52
  • 2
    Could you provide us a part of the **content of this file?** Add it to your question so that we know what it looks like. The same for an **example of HTML file**. This will help us see what's in the HTML and if the regular expression will do it or if we need a DOM parser. – Patrick Janser Jan 21 '22 at 14:59
  • The file contains lines with an ID number and the file name to remove the code block from. ID123:/project/bridge/file1.html ID456:/project/car/file114.html The files are simple html files with lists group items and each list group item block has an ID. We need to remove some of these entries using the ID number, so the line containing id="123" is the start of the block and the end is two /DIV lines. – harry_belefonte Jan 21 '22 at 15:03
  • Be wary of removing HTML with regular expressions - you'll need to make sure you have the right closing tag to match the start tag, or risk breaking the HTML. Hence, I strongly recommend using an HTML DOM parser. – SEoF Jan 21 '22 at 15:58
  • Please share more details. What have you tried so far? Where are you stuck? If you are really working with **thousands** of files, why not use something other than PHP? – Nico Haase Jan 21 '22 at 16:14

1 Answers1

0

With the help of your comments, I understand that you have a file containing the id to search and the corresponding HTML file to correct.

Step 1: read your input file with the ids and filenames

Let's say that "ids_and_files.txt" contains this:

ID123:/project/bridge/file1.html
ID456:/project/car/file114.html
ID789:/project/tunnel/file1.html
ID789:/project/bridge/file1.html

In this case, in PHP, I would read the file with the file() function that returns and array of lines. Then for each line, you use a regex to extract the id and the filename. You may have several ids for the same file, so I fill a dictionnary where the key is the filename and the value is the list of ids to remove:

<?php

define('LIST_FILENAME', 'ids_and_files.txt');

header('Content-Type: text/plain');

$list_lines = file(LIST_FILENAME, FILE_IGNORE_NEW_LINES) or die('Cannot read ids file!');

$files_and_ids = [];

foreach ($list_lines as $line) {
    if (preg_match('/^ID(\d+):(.*)$/', $line, $matches)) {
        // If the file doesn't exist in the list then create the array of ids.
        if (!isset($files_and_ids[$matches[2]])) {
            $files_and_ids[$matches[2]] = [
                $matches[1]
            ];
        } else {
            // The file entry already exists but there's already another id to remove in it.
            // Add the id to the list of ids.
            $files_and_ids[$matches[2]][] = $matches[1];
        }
    } else {
        print "ERROR: Could not extract id and filename from '$line'\n";
    }
}

print var_export($files_and_ids, true) . "\n";

This will print the following dictionnary:

array (
  '/project/bridge/file1.html' => array (
    0 => '123',
    1 => '789',
  ),
  '/project/car/file114.html' => array (
    0 => '456',
  ),
  '/project/tunnel/file1.html' => array (
    0 => '789',
  ),
)

Step 2: Find and replace in the HTML

  • Loop over all the files from our dictionnary.
  • For each file, build the regular expression based on the list of ids to search and remove.
  • Read the HTML file into a string (we have enough memory).
  • Do the preg_replace() operation to get the new HTML.
  • Write it back to the file if some replacements have been done.

The regular expression pattern

#<div\s+class="list-group-item"\s+id="(?:123|789)">.*?</div>\r?\n</div>#s

In PHP, you can use any char to start and stop your pattern and then add the modifiers. So I used # instead of the conventionnal / (typically in JS) because I don't want to escape them in </div>.

I used \s+ instead of the space, in case of variations. I also used \r?\n in case some files are stored in UNIX lines instead of Windows.

The s modifier makes the . match any char, including new lines.

The PHP code

foreach ($files_and_ids as $filename => $ids) {
    $pattern = '#<div\s+class="list-group-item"\s+id="(?:' . implode('|', $ids) . ')">.*?</div>\r?\n</div>#s';

    print "\nSearch in $filename\n";
    print "Pattern: $pattern\n";

    $html = @file_get_contents($filename); // @ will remove warnings as we test it after.
    if ($html === false) {
        print "ERROR: Could not open file '$filename'\n";
    } else {
        $new_html = preg_replace($pattern, '', $html, -1, $count);
        if ($count > 0) {
            print "$count replacements in '$filename'\n";
            if (!file_put_contents($filename, $new_html)) {
                print "ERROR: Could not write file '$filename'\n";
            }
        } else {
            printf("ERROR: IDs %s could not be found in '%s'\n", implode(', ', $ids), $filename);
        }
    }
}

Example of output where I tested the error handling:

Search in file1.html
Pattern: #<div\s+class="list-group-item"\s+id="(?:123|789)">.*?</div>\r?\n</div>#s
2 replacements in 'file1.html'

Search in file114.html
Pattern: #<div\s+class="list-group-item"\s+id="(?:456)">.*?</div>\r?\n</div>#s
ERROR: Could not open file 'file114.html'

Search in file2.html
Pattern: #<div\s+class="list-group-item"\s+id="(?:789)">.*?</div>\r?\n</div>#s
ERROR: IDs 789 could not be found in 'file2.html'
Patrick Janser
  • 3,318
  • 1
  • 16
  • 18