0

I need to remove the first instance of the <P> and </P> tags in multiple .htm files, all in a single directory, using a batch command. Any suggestions.

Edit - I just realized that there may be multiple DIVs in the .htm files, and so I would need to remove only the 1st instance of the <P> and </P> tags in each DIV (if any). And to clarify, I only want the tags removed, but do want the content/text in between the tags to remain. Thanks for the answers/comments thus far!!!

As for why, long story, but just know I work for an agency that has a contract with a vendor who did not test the version we paid for with IE11. As a result, only the first paragraph tag, when more than one paragraph, is making all text display 15 pixels lower than expected. I cannot change or modify the vendor's code, however, I can modify it after the elearning course has been exported. Which is what I need this batch file for. If I remove only the first instance of the paragraph tag on each page, then the entire text displays as expected.

Kelly
  • 1
  • 3

6 Answers6

3

The safest solution (albeit perhaps the slowest and most complicated) would be to parse your HTML files as HTML and remove the first paragraph from the DOM. This would give you the benefit of not being restricted to any sort of dependable formatting of the HTML source. Comments are properly skipped, line breaks are handled correctly, and life is all sunshine and daisies. Parsing the HTML DOM can be done using an InternetExplorer.Application COM object. Here's a batch / JScript hybrid example:

@if (@CodeSection == @Batch) @then

@echo off
setlocal

for %%I in (*.html) do (
    cscript /nologo /e:JScript "%~f0" "%%~fI"
)

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

WSH.Echo(WSH.Arguments(0));

var fso = WSH.CreateObject('scripting.filesystemobject'),
    IE = WSH.CreateObject('InternetExplorer.Application'),
    htmlfile = fso.GetAbsolutePathName(WSH.Arguments(0));

IE.Visible = 0;
IE.Navigate('file:///' + htmlfile.replace(/\\/g, '/'));
while (IE.Busy || IE.ReadyState != 4) WSH.Sleep(25);

var p = IE.document.getElementsByTagName('p');

if (p && p[0]) {

    /* If you want to remove the surrounding <p></p> only
    while keeping the paragraph's inner content, uncomment
    the following line: */

    // while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);

    p[0].parentNode.removeChild(p[0]);
    htmlfile = fso.CreateTextFile(htmlfile, 1);
    htmlfile.Write('<!DOCTYPE html>\n'
        + '<html>\n'
        + IE.document.documentElement.innerHTML
        + '\n</html>');
    htmlfile.Close();
}

IE.Quit();
try { while (IE && IE.Busy) WSH.Sleep(25); }
catch(e) {}

And because you're working with the DOM, additional tweaks are made easier. To delete the first <p> element within each <div> element (just as a wild example, not that anyone would ever want this ), navigate the DOM as you would in browser-based JavaScript.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

for %%I in ("*.htm") do (
    echo Batch section: "%%~fI"
    cscript /nologo /e:JScript "%~f0" "%%~fI"
)

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

WSH.Echo('JScript section: "' + WSH.Arguments(0) + '"');

var fso = WSH.CreateObject('scripting.filesystemobject'),
    IE = WSH.CreateObject('InternetExplorer.Application'),
    htmlfile = fso.GetAbsolutePathName(WSH.Arguments(0)),
    changed;

IE.Visible = 0;
IE.Navigate('file:///' + htmlfile.replace(/\\/g, '/'));
while (IE.Busy || IE.ReadyState != 4) WSH.Sleep(25);

for (var d = IE.document.getElementsByTagName('div'), i = 0; i < d.length; i++) {

    var p = d[i].getElementsByTagName('p');
    if (p && p[0]) {

        // move contents of p node up to parent
        while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);

        // delete now empty p node
        p[0].parentNode.removeChild(p[0]);
        changed = true;
    }
}

if (changed) {
    htmlfile = fso.CreateTextFile(htmlfile, 1);
    htmlfile.Write('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n'
        + '<HTML xmlns:t= "urn:schemas-microsoft-com:time" xmlns:control>\n'
        + IE.document.documentElement.innerHTML
        + '\n</HTML>');
    htmlfile.Close();
}

IE.Quit();
try { while (IE && IE.Busy) WSH.Sleep(25); }
catch(e) {}
rojo
  • 24,000
  • 5
  • 55
  • 101
  • Looks and sounds like a sturdy solution - have my vote! Would you care to explain the hybrid aspect and whether you save it as a `.BAT` file and how you run it please? – Mark Setchell Feb 11 '15 at 10:13
  • Yep, you save it as a `.bat` file and run it as you would any other `.bat` script. Batch evaluates the first line as false and continues to the next line. When the batch thread reaches the `cscript` line, the script is re-evaluated as JScript. JScript evaluates the first line as false and skips to `@end`. When the JScript reaches the end of the script, it returns control to batch. And then batch performs the next loop iteration. [See this page](https://gist.github.com/davidruhmann/5199433) for more examples of hybrid batch scripts. – rojo Feb 11 '15 at 12:04
  • Which version of Windows are you running this on? I run Win7 x64 at home and at work; with IE9 installed at work, and IE11 installed at home. It runs without error in both places. Does the path to your html files include any weird characters? I wouldn't think that would matter, but I'd like to rule it out as a cause anyway. – rojo Feb 11 '15 at 19:58
  • I did change your code to search out .htm files instead of .html, as they are all .htm files only. I also did uncomment your code to keep the text in between the tags. However, I also realized that some pages have multiple DIVs, and I need to remove only the first paragraph tags in each DIV in each .htm page. – Kelly Feb 11 '15 at 20:06
  • Try enclosing the `while (IE.Busy || IE.ReadyState != 4) WSH.Sleep(25);` within a `try { while (IE.Busy etc... }` block, and on the next line put `catch(e) { i=0; while (!(IE.document.getElementsByTagName('p') || [0])[0] && ++i < 200) WSH.Sleep(25); }`. I don't understand why `IE.Busy` or `IE.ReadyState` would cause the script to choke in that environment, regardless of security policy. That's what's on line 24, where the error occurs anyway. – rojo Feb 11 '15 at 20:22
  • And are you saying you want not just the first paragraph tag deleted, but the first paragraph tag of every div? That's quite a departure from the spirit of the original question at the top of the page, wouldn't you say? I really think your self-prescribed solution needs to be reinspected. The correct fix should be to fix the CSS, not to hack the page structure. – rojo Feb 11 '15 at 20:24
  • Try right-clicking the script and choosing "Run as administrator". See if that makes a difference. – rojo Feb 11 '15 at 20:29
  • And yes, that's precisely what I'm saying. And I agree, it's a horrible hack job. However, the vendor will no longer support our current version of this software, and I can't find anywhere that is causing only the first paragraph of each DIV to drop 15 pixels. The other oddity, is in IE8, this doesn't happen. And yes, I've got compatibility mode turned on in IE11, doesn't help. We have thousands of employees who will be taking this training and IE11 is being rolled out to our agency very soon. We can't have text overlaying graphics or other text on the screen, which is what's happening. – Kelly Feb 11 '15 at 20:29
  • Good times. You could add a CSS rule for `div > p:first-of-type { margin-top: -15px }` or similar. :) Before the line where the error occurs, try `WSH.Echo(htmlfile)` and see whether you notice anything odd about the path name that's received by `IE.Navigate` – rojo Feb 11 '15 at 20:31
  • What doctype is defined in the htm files? – rojo Feb 11 '15 at 20:34
  • Here's the doctype in the .htm files... – Kelly Feb 11 '15 at 20:40
  • Now we're getting somewhere. `WSH.Arguments(0)` is apparently empty. When you changed `*.html` to `*.htm`, did you happen to remove the `"%%~fI"` argument from the end of the `cscript` line? – rojo Feb 11 '15 at 20:40
  • The first line if the JScript portion says `WSH.Echo(WSH.Arguments(0));`. Does that echo the filename? Or is the *only* output of the script the error on line 24? In the batch section, what happens if you `echo %%~fI` within the `for` loop just above `cscript`? What happens if you enclose `"*.htm"` within quotes, so it reads `for %%I in ("*.htm") do (`? – rojo Feb 11 '15 at 20:49
  • No, there is no filename displayed, just the error message. Though when I'm inside the for %%I in the batch part, and put in the echo, it does show the actual .htm filename. However, if I echo outside the for $$I, I get nothing to echo. Also, no change if I put *.htm in quotes. – Kelly Feb 11 '15 at 20:53
  • I just noticed: Transitional DTD? No wonder you've got browser inconsistency. :P Alright. So you haven't messed with your `cscript` line, right? It's still `cscript /nologo /e:JScript "%~f0" "%%~fI"` character-for-character? Does the path containing your HTM files, or the HTM file names themselves, contain any odd characters? I'm hoping we can get this issue figured out, as modifying my script to remove the first `

    ` tag of every `

    ` shouldn't be too difficult to add.
    – rojo Feb 11 '15 at 21:09
  • What? There's no `!` anywhere in the `IE.Navigate` line. Anyway, I just updated my answer to include code for removing the first `

    ` of every `

    `. Give that a shot and see what happens.
    – rojo Feb 11 '15 at 21:32
  • By the way, I hope you're testing this on copies and not your production `htm` files, just in case something goes egregiously wrong. – rojo Feb 11 '15 at 21:39
  • 1
    So, I realize now what was happening... When I copied and pasted, it was breaking the Navigate line and adding the ! between htmlfile. and replace. So now when I ran it, it opened all 68 pages in IE11 and froze up my laptop. – Kelly Feb 11 '15 at 21:39
  • Nice to have the mystery solved. Anyway, I'mma peace out for now. Good luck! If my answer was helpful, please consider marking it as accepted. [See this page](http://meta.stackexchange.com/q/5234/275822) for an explanation of why this is important. Re: opening 68 pages, try closing IE before running the script and see if it makes a difference. – rojo Feb 11 '15 at 21:45
  • Even with closed, it is still opening all the files in the folder as separate windows. In the background, the dos command shows the Batch and JScript sections with the path and filename, but then says there's an error on the while (IE.Busy... line (26,1) (null) and states "Unspecified error". – Kelly Feb 11 '15 at 21:52
  • Did the first few htm files get modified? Try adding `WSH.Sleep(5000)` as the last line of the script to force a 5-second pause between loop iterations. Same result? – rojo Feb 11 '15 at 22:01
  • Still opening IE11, but IE says "IE restricted this webpage from running scripts or ActiveX controls." and a button to allow blocked content. I'm not sure why IE needs to open at all. Shouldn't it be static and just edit the .htm file without the need to actually open it? Also in the Dos command, there's a new error message that states, "The object invoked has disconnected from its clients. Still at the (26,1) line. – Kelly Feb 11 '15 at 22:11
  • And no, the files are not being modified. – Kelly Feb 11 '15 at 22:11
  • I had a thought this evening. Actually, you gave me the idea when you asked, "Shouldn't it be static and just edit the .htm file without the need to actually open it?" Yeah, the `InternetExplorer.Application` COM object is not very, but there aren't very many programmatic ways to evaluate the DOM without it. However, the DOM applies to XML as well. [See this answer](http://stackoverflow.com/a/28468671/1683264) for a similar but different direction. – rojo Feb 12 '15 at 03:21
2

The solution you were probably expecting, a pure batch solution, would involve a bunch of for loops. This example will strip the entire line(s) from the first <p> to the first </p>.

I'm sure npocmaka, MC ND, Aacini, jeb or dbenham can accomplish this with half the code and ten times the efficiency. *shrug*

This is the middle-of-the-road solution, offering more tolerance for line breaks within your <p> tag than the PowerShell regexp replacement, but not quite as safe as the InternetExplorer.Application COM object JScript hybrid.

@echo off
setlocal

for %%I in (*.html) do (

    set p_on_line=

    rem // get line number of first <p> tag
    for /f "tokens=1 delims=:" %%n in (
        'findstr /i /n "<p[^ar]" "%%~fI"'
    ) do if not defined p_on_line set "p_on_line=%%n"

    if defined p_on_line (

        rem // process file line-by-line
        setlocal enabledelayedexpansion
        for /f "delims=" %%L in ('findstr /n "^" "%%~fI"') do (
            call :split num line "%%L"

            rem // If <p> has not yet been reached, copy line to new file
            if !num! lss !p_on_line! (
                >>"%%~dpnI.new" echo(!line!
            ) else (
                rem // If </p> has been reached, resume writing.
                if not "!line!"=="!line:</p>=!" set p_on_line=2147483647
            )
        )
        endlocal
        if exist "%%~dpnI.new" move /y "%%~dpnI.new" "%%~fI" >NUL
    )
)

goto :EOF

:split <num_var> <line_var> <string>
setlocal disabledelayedexpansion
set "line=%~3"
for /f "tokens=1 delims=:" %%I in ("%~3") do set "num=%%I"
set "line=%line:*:=%"
endlocal & set "%~1=%num%" & set "%~2=%line%"
goto :EOF
rojo
  • 24,000
  • 5
  • 55
  • 101
1

The shortest solution would be to use a PowerShell one-liner.

powershell -command "gci '*.html' | %{ ([regex]'<p\W.*?</p>').replace([IO.File]::ReadAllText($_),'',1) | sc $_ }"

Please note that this will only work if there are no line breaks within the first paragraph. If there's a line break between <p> and </p> this will keep searching until it finds a paragraph that doesn't have a line break. You might be better off trying to fix the vendor's broken CSS than this hackish workaround.

Anyway, the command above roughly translates thusly:

  • In the current directory, get child items matching *.html
  • For each matching html file (the % is an alias for foreach-object):

    • Create a regex object matching from <p to shining </p>
    • Call that regex object's replace method with the following params:

      • use the HTML file contents as the haystack,
      • replace the needle with nothing,
      • and do this 1 time.
    • Set the content of the HTML file to be the result.

I used [IO.File]::ReadAllText($_) rather than gc $_ to preserve line breaks. Using get-content with [regex].replace mashes everything together into one line. I used a [regex] object rather than a simpler -replace switch because -replace is global.

Community
  • 1
  • 1
rojo
  • 24,000
  • 5
  • 55
  • 101
1
@ECHO Off
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET "destdir=U:\destdir"
PUSHD "%sourcedir%"
FOR /f "delims=" %%f IN ('dir /b /a-d "q28443084*" ') DO ((
 SET "zap=<P>"
 FOR /f "usebackqdelims=" %%a IN ("%%f") DO (
  IF DEFINED zap (
   SET "line=%%a"
   CALL :process
   IF DEFINED keep (ECHO(%%a) ELSE (iF DEFINED line CALL ECHO(%%line%%)
  ) ELSE (ECHO(%%a)
 )
 )>"%destdir%\%%f"
)
popd

GOTO :EOF

:process
SET "keep="
CALL SET "line2=%%line:%zap%=%%"
IF "%line%" equ "%line2%" SET "keep=y"&GOTO :EOF
SET "line=%line2%"
IF "%zap%"=="</P>" SET "zap="&GOTO :EOF 
SET "zap=</P>"
IF NOT DEFINED line GOTO :EOF 
SET "line=%line2:</P>=%"
IF "%line%" neq "%line2%" SET "zap="
GOTO :eof

This may work - it will suppress empty lines.

I chose to process files matching the mask q28443084*in directory u:\sourcedir to matching filenames in u:\destdir - you would need to change these settings to suit.

The process revolves around the setting of zap, which may be set to either <P>, </P> or nothing. The incoming line is examined, and either kept as-is if it does not contain zap or is output in modified form and zap adjusted to the next value. if zap is nothing then just reproduce input to output.

Magoo
  • 77,302
  • 8
  • 62
  • 84
1

Here's a similar solution to the HTML DOM answer. If your HTML is valid, you could try to parse it as XML. The advantage here is, where the InternetExplorer.Application COM object loads an entire fully-bloated instance of Internet Explorer for each page load, instead you're loading only a dll (msxml3.dll). This should hopefully handle multiple files more efficiently. The down side is that the XML parser is finicky about the validity of your tag structure. If, for example, you have an unordered list where the list items are not closed:

<ul>
    <li>Item 1
    <li>Item 2
</ul>

... a web browser would understand that just fine, but the XML parser will probably error. Anyway, it's worth a shot. I just tested this on a directory of 500 identical HTML files, and it worked through them in less than a minute.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

for %%I in ("*.htm") do (
    cscript /nologo /e:JScript "%~f0" "%%~fI"
)

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

WSH.StdOut.Write('Checking ' + WSH.Arguments(0) + '... ');

var fso = WSH.CreateObject('scripting.filesystemobject'),
    DOM = WSH.CreateObject('Microsoft.XMLDOM'),
    htmlfile = fso.OpenTextFile(WSH.Arguments(0), 1),
    html = htmlfile.ReadAll().split(/<\/head\b.*?>/i),  
    head = html[0] + '</head>',
    body = html[1].replace(/<\/html\b.*?>/i,''),
    changed;

htmlfile.Close();

// attempt to massage body string into valid XHTML
var self_closing_tags = ['area','base','br','col',
    'command','comment','embed','hr','img','input',
    'keygen','link','meta','param','source','track','wbr'];

body = body.replace(/<\/?\w+/g, function(m) { return m.toLowerCase(); }).replace(
    RegExp([    // should match <br>
        '<(',
            '(' + self_closing_tags.join('|') + ')',
            '([^>]+[^\/])?',    // for tags with properties, tag is unclosed
        ')>'
    ].join(''), 'ig'), "<$1 />"
);  

DOM.loadXML(body);
DOM.async = false;

if (DOM.parseError.errorCode) {
   WSH.Echo(DOM.parseError.reason);
   WSH.Quit(0);
}

for (var d = DOM.documentElement.getElementsByTagName('div'), i = 0; i < d.length; i++) {

    var p = d[i].getElementsByTagName('p');
    if (p && p[0]) {

        // move contents of p node up to parent
        while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);

        // delete now empty p node
        p[0].parentNode.removeChild(p[0]);
        changed = true;
    }
}

html = head + DOM.documentElement.xml + '</html>';

if (changed) {
    htmlfile = fso.CreateTextFile(WSH.Arguments(0), 1);
    htmlfile.Write(html);
    htmlfile.Close();
    WSH.Echo('Fixed!');
}
else WSH.Echo('Nothing to change.');
rojo
  • 24,000
  • 5
  • 55
  • 101
  • @Kelly Try this edit. That "a name was started with an invalid character" is an error I was getting before I stripped the useless xmlns garbage out of the `` tag in my tests. I decided there's no reason to attempt to parse anything before `` anyway, so this revision passes only the contents of `` into the XML parser. See whether you have better luck with this. If this doesn't work, I may need you to pastebin an example `.htm` file so I can try to figure out what specifically is choking the XML parser, and massage it out. – rojo Feb 12 '15 at 15:01
  • @Kelly Dude, this code is a mess. I'm having a hard time fitting this round peg into the XML parser's square hole. But the `InternetExplorer.Application` script above worked perfectly. I copied the sample htm file 500 times in a folder and ran the script, and it did all 500 in about a minute. It was slower than the XML parser, but not painfully so, and I didn't get any ActiveX warnings or have any IE windows display at all. Is there any chance you could give yesterday afternoon's script another shot on a different computer? – rojo Feb 12 '15 at 17:26
  • LOL - yes, I know the code is a mess. Gotta love vendors. Anyways, I retried the IE application script above, and I keep getting 60 IE windows opening up and error at line 26,1. It says (null): Unspecified error – Kelly Feb 12 '15 at 17:51
  • Try running it with only one htm file in the directory. Does the IE window still open? Is it asking for permission or complaining of something being blocked? Does the error in the console still say "Unspecified error?" – rojo Feb 12 '15 at 17:53
  • So does IE complain of something being blocked? I bet if you did the same on your home computer, it'd work as intended. I'm struggling to understand what's different about your computer's operating environment from mine. As I said yesterday, that script works both on my domain-connected Windows 7 Enterprise w/ IE9 and my home Windows 7 Home Edition w/ IE11. Oh well, let me catch up on some emails and maybe I'll try to rewrite the XML parser only to load and fix the div tags themselves. (Split the html at `/<\/?div\b.*?>/g` or something.) – rojo Feb 12 '15 at 18:06
  • If you didn't have to support IE8 you could probably do a CSS declaration of `div > p:first-of-type { display: inline }`. But `first-of-type` is IE9+. – rojo Feb 12 '15 at 18:17
  • When IE opens, it wants me to click to allow blocked content. When I click to allow, it opens the page, but that's it. I tried looking at my IE security settings, and am unable to select "Allow active content from CDs to run on My Computer" or "Allow software to run or install even if the signature is invalid". Both of these options are greyed out on all of our laptops. – Kelly Feb 12 '15 at 18:44
0

For posterity, I found another solution. O.P. was having problems with browser security and group policy restrictions preventing the InternetExplorer.Application COM object from behaving as expected, and the HTML he's fixing cannot reasonably be massaged into valid XML for the Microsoft.XMLDOM parser. But I'm optimistic that the htmlfile COM object won't suffer from these same infirmities.

As I emailed the O.P.:

Peppered around Google searches I found occasional references to a mysterious COM object called "htmlfile". It appears to be a way to build and interact with the HTML DOM without using the IE engine. I can't find any documentation on it on MSDN, but I managed to scrape together enough methods and properties from trial and error to make the script work.

I've since discovered that there's more to the htmlfile COM object than meets the eye -- htmlfileObj.parentWindow.clipboardData for example (MSDN reference).

Anyway, I was most optimistic about this solution, but O.P. has stopped returning my emails. Perhaps it'll be useful to someone else though.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

for %%I in ("*.htm") do cscript /nologo /e:JScript "%~f0" "%%~fI"

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

WSH.StdOut.Write(WSH.Arguments(0) + ': ');

var fso = WSH.CreateObject('scripting.filesystemobject'),
    DOM = WSH.CreateObject('htmlfile'),
    htmlfile = fso.OpenTextFile(WSH.Arguments(0), 1),
    html = htmlfile.ReadAll(),
    head = html.split(/<body\b.*?>/i)[0],
    bodyTag = html.match(/<body\b.*?>/i)[0],
    changed;

DOM.write(html);
htmlfile.Close();

if (DOM.getElementsByName('p_tag_fixed').length) {
    WSH.Echo('fix already applied.');
    WSH.Quit(0);
}

for (var d = DOM.body.getElementsByTagName('div'), i = 0; i < d.length; i++) {

    var p = d[i].getElementsByTagName('p');
    if (p && p[0]) {

        // move contents of p node up to parent
        while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);

        // delete now empty p node
        p[0].parentNode.removeChild(p[0]);

        changed = true;
    }
}

if (changed) {
    htmlfile = fso.CreateTextFile(WSH.Arguments(0), 1);
    htmlfile.Write(
        head
        + '<meta name="p_tag_fixed" />'
        + bodyTag
        + DOM.body.innerHTML
        + '</body></html>'
    );
    htmlfile.Close();
    WSH.Echo('Fixed!')
}
else WSH.Echo('unchanged.');
Community
  • 1
  • 1
rojo
  • 24,000
  • 5
  • 55
  • 101