2

This question has been asked a lot on stackoverflow, but I can't seem to be able to make it work. Any hints appreciated. Here is a text file (extension .mpl) containing offending text that needs to be removed:

plotsetup('ps', 'plotoutput = "plotfile.eps"', 'plotoptions' = "color=rgb,landscape,noborder");
print(PLOT3D(MESH(Array(1..60, 1..60, 1..3, [[[.85840734641021,0.,-0.],
[HFloat(undefined),HFloat(undefined),HFloat(undefined)],[.857971665313419,.0917163905694189,-.16720239349226],
... more like that ...
[.858407346410207,-3.25992468340355e-015,5.96532373555817e-015]]], datatype = float[8], order = C_order)),SHADING(ZHUE),STYLE(PATCHNOGRID),TRANSPARENCY(.3),LIGHTMODEL(LIGHT_4),ORIENTATION(35.,135.),SCALING(CONSTRAINED),AXESSTYLE(NORMAL)));

I want to remove every instance of:

[HFloat(undefined),HFloat(undefined),HFloat(undefined)],

and there are thousands such instances!. Note: the square brackets and the comma are to be removed. There is no space, so I have pages and pages of:

[HFloat(undefined),HFloat(undefined),HFloat(undefined)],   
[HFloat(undefined),HFloat(undefined),HFloat(undefined)],   
[HFloat(undefined),HFloat(undefined),HFloat(undefined)],

I won't list here all my failed attempts. Below is the closest I've come:

@echo off

SetLocal 
cd /d %~dp0

if exist testCleaned.mpl del testCleaned.mpl

SetLocal EnableDelayedExpansion

Set OldString=[HFloat(undefined),HFloat(undefined),HFloat(undefined)],
Set NewString=

pause

FOR /F "tokens=* delims= " %%I IN (test.mpl) DO (
    set str=%%I
    set str=!str:OldString=NewString!
    echo !str! >> testCleaned.mpl
    endlocal
)

EndLocal

The above was strung together, as it were, from pieces of code found on the web, especially at stackoverflow, e.g. Problem with search and replace batch file

What it does is produce a truncated file, as follows:

plotsetup('ps', 'plotoutput = "plotfile.eps"', 'plotoptions' = "color=rgb,landscape,noborder"); 
!str! 

Please don't hesitate to request clarifications. Apologies if you feel that this question has already been answered. I would very much appreciate if you would copy-paste the relevant code for me, as I have tried for several hours.

Bonus: can this automatic naming be made to work? "%%~nICleaned.mpl"

Community
  • 1
  • 1
PatrickT
  • 10,037
  • 9
  • 76
  • 111
  • have you tried any other tools rather than DOS script? – kev Dec 21 '11 at 11:35
  • I'd recommend using C# script: http://www.csscript.net/ – Hybrid Dec 21 '11 at 11:55
  • kev, no I haven't tried other scripts. The content of the file I need to modify is a messed-up postscript file created with Maple (Maplesoft), I tried to fix it with Maple's StringTools, but while it works for small files, it doesn't for large files (for some reason). After several hours of unsuccessful debugging I thought I'd google for something else, and dos script is one of the most common hits. I'd be happy to try something else, if it doesn't involve hours of installations of new softs. I've got a fully working python system on ubuntu, if that helps. Thanks! – PatrickT Dec 21 '11 at 14:54
  • Hybrid, thanks for your suggestion. I've just received several answers, so I'll need a few minutes to process them, but I'll follow up on your link shortly. Is that a script you could help me with? ;-) – PatrickT Dec 21 '11 at 14:55
  • And one big thank you to the administrator who reformatted my question. Very decent of you. I'd made a mess of it. You rock. – PatrickT Dec 21 '11 at 14:57
  • correction, kev: this is not a postscript file, obviously, it's a file that will be processed by Maple to make it into a postscript file. The presence of the invalid entries mentioned in the question prevents Maple from executing properly. If the offending entries are manually removed, Maple will execute properly. Hence my search for a way to automate the removal of offending entries (I have a bunch of such files to process, it's not just a one off). – PatrickT Dec 21 '11 at 15:55
  • no-one has mentioned the power shell, any reason not to try that? I was googling sed, C# script, and so on, and came across this power shell. Version 3 for Windows 8 is out. It seems to have received good reviews. I couldn't get it to work either... $original_file = 'C:\test\test.mpl' $destination_file = 'C:\test\testClean.mpl' (Get-Content $original_file) | Foreach-Object { $_ -replace '[HFloat(undefined),HFloat(undefined),HFloat(undefined)],', '' } | Set-Content $destination_file – PatrickT Dec 21 '11 at 19:11
  • I figured out a way to do (sort of) it but it leaves blank lines. Is that ok or do you need to remove the lines as well? – MaskedPlant Dec 21 '11 at 19:50

4 Answers4

6

The biggest problem with your existing code is the SetLocal enableDelayedExpansion is missplaced - it should be within the loop after set str=%%I.

Other problems:

  • will strip lines beginning with ;
  • will strip leading spaces from each line
  • will strip blank (empty) lines
  • will print ECHO is off if any lines becomes empty or contains only spaces after substitution
  • will add extra space at end of each line (didn't notice this until I read jeb's answer)

Optimization issue - using >> can be relatively slow. It is faster to enclose the whole loop in () and then use >

Below is about the best you can do with Windows batch. I auto named the output as requested, doing one better - It automatically preserves the extension of the original name.

@echo off
SetLocal
cd /d %~dp0
Set "OldString=[HFloat(undefined),HFloat(undefined),HFloat(undefined)],"
Set "NewString="
set file="test.mpl"
for %%F in (%file%) do set outFile="%%~nFCleaned%%~xF"
pause
(
  for /f "skip=2 delims=" %%a in ('find /n /v "" %file%') do (
    set "ln=%%a"
    setlocal enableDelayedExpansion
    set "ln=!ln:*]=!"
    if defined ln set "ln=!ln:%OldString%=%NewString%!"
    echo(!ln!
    endlocal
  )
)>%outFile%

Known limitations

  • limited to slightly under 8k per line, both before and after substitution
  • search string cannot include = or !, nor can it start with * or ~
  • replacement string cannot include !
  • search part of search and replace is case insensitive
  • last line will always end with newline <CR><LF> even if original did not

All but the first limitation could be eliminated, but it would require a lot of code, and would be horrifically slow. The solution would require a character by character search of each line. The last limitation would require some awkward test to determine if the last line was newline terminated, and then last line would have to be printed using <nul SET /P "ln=!ln!" trick if no newline wanted.

Interesting feature (or limitation, depending on perspective)

  • Unix style files ending lines with <LF> will be converted to Windows style with lines ending with <CR><LF>

There are other solutions using batch that are significantly faster, but they all have more limitations.

Update - I've posted a new pure batch solution that is able to do case sensitive searches and has no restrictions on search or replacement string content. It does have more restrictions on line length, trailing control characters, and line format. Performance is not bad, especially if the number of replacements is low. http://www.dostips.com/forum/viewtopic.php?f=3&t=2710

Addendum

Based on comments below, a batch solution will not work for this particular problem because of line length limitation.

But this code is a good basis for a batch based search and replace utility, as long as you are willing to put up with the limitations and relatively poor performance of batch.

There are much better text processing tools available, though they are not standard with Windows. My favorite is sed within the GNU Utilities for Win32 package. The utilities are free, and do not require any installation.

Here is a sed solution for Windows using GNU utilities

@echo off
setlocal
cd /d %~dp0
Set "OldString=\[HFloat(undefined),HFloat(undefined),HFloat(undefined)\],"
Set "NewString="
set file="test.mpl"
for %%F in (%file%) do set outFile="%%~nFCleaned%%~xF"
pause
sed -e"s/%OldString%/%NewString%/g" <%file% >%outfile%


Update 2013-02-19

sed may not be an option if you work at a site that has rules forbidding the installation of executables downloaded from the web.

JScript has good regular expression handling, and it is standard on all modern Windows platforms, including XP. It is a good choice for performing search and replace operations on Windows platforms.

I have written a hybrid JScript/Batch search and replace script (REPL.BAT) that is easy to call from a batch script. A small amount of code gives a lot of powerful features; not as powerful as sed, but more than enough to handle this task, as well as many others. It is also quite fast, much faster than any pure batch solution. It also does not have any inherent line length limitations.

Here is a batch script that uses my REPL.BAT utility to accomplish the task.

@echo off
setlocal
cd /d %~dp0
Set "OldString=[HFloat(undefined),HFloat(undefined),HFloat(undefined)],"
Set "NewString="
set file="test.txt"
for %%F in (%file%) do set outFile="%%~nFCleaned%%~xF"
pause
call repl OldString NewString le <%file% >%outfile%

I use the L option to specify a literal search string instead of a regular expression, and the E option to pass the search and replace strings via environment variables by name, instead of using string literals on the command line.

Here is the REPL.BAT utility script that the above code calls. Full documentation is encluded within the script.

@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment

::************ Documentation ***********
:::
:::REPL  Search  Replace  [Options  [SourceVar]]
:::REPL  /?
:::
:::  Performs a global search and replace operation on each line of input from
:::  stdin and prints the result to stdout.
:::
:::  Each parameter may be optionally enclosed by double quotes. The double
:::  quotes are not considered part of the argument. The quotes are required
:::  if the parameter contains a batch token delimiter like space, tab, comma,
:::  semicolon. The quotes should also be used if the argument contains a
:::  batch special character like &, |, etc. so that the special character
:::  does not need to be escaped with ^.
:::
:::  If called with a single argument of /? then prints help documentation
:::  to stdout.
:::
:::  Search  - By default this is a case sensitive JScript (ECMA) regular
:::            expression expressed as a string.
:::
:::            JScript syntax documentation is available at
:::            http://msdn.microsoft.com/en-us/library/ae5bf541(v=vs.80).aspx
:::
:::  Replace - By default this is the string to be used as a replacement for
:::            each found search expression. Full support is provided for
:::            substituion patterns available to the JScript replace method.
:::            A $ literal can be escaped as $$. An empty replacement string
:::            must be represented as "".
:::
:::            Replace substitution pattern syntax is documented at
:::            http://msdn.microsoft.com/en-US/library/efy6s3e6(v=vs.80).aspx
:::
:::  Options - An optional string of characters used to alter the behavior
:::            of REPL. The option characters are case insensitive, and may
:::            appear in any order.
:::
:::            I - Makes the search case-insensitive.
:::
:::            L - The Search is treated as a string literal instead of a
:::                regular expression. Also, all $ found in Replace are
:::                treated as $ literals.
:::
:::            E - Search and Replace represent the name of environment
:::                variables that contain the respective values. An undefined
:::                variable is treated as an empty string.
:::
:::            M - Multi-line mode. The entire contents of stdin is read and
:::                processed in one pass instead of line by line. ^ anchors
:::                the beginning of a line and $ anchors the end of a line.
:::
:::            X - Enables extended substitution pattern syntax with support
:::                for the following escape sequences:
:::
:::                \\     -  Backslash
:::                \b     -  Backspace
:::                \f     -  Formfeed
:::                \n     -  Newline
:::                \r     -  Carriage Return
:::                \t     -  Horizontal Tab
:::                \v     -  Vertical Tab
:::                \xnn   -  Ascii (Latin 1) character expressed as 2 hex digits
:::                \unnnn -  Unicode character expressed as 4 hex digits
:::
:::                Escape sequences are supported even when the L option is used.
:::
:::            S - The source is read from an environment variable instead of
:::                from stdin. The name of the source environment variable is
:::                specified in the next argument after the option string.
:::

::************ Batch portion ***********
@echo off
if .%2 equ . (
  if "%~1" equ "/?" (
    findstr "^:::" "%~f0" | cscript //E:JScript //nologo "%~f0" "^:::" ""
    exit /b 0
  ) else (
    call :err "Insufficient arguments"
    exit /b 1
  )
)
echo(%~3|findstr /i "[^SMILEX]" >nul && (
  call :err "Invalid option(s)"
  exit /b 1
)
cscript //E:JScript //nologo "%~f0" %*
exit /b 0

:err
>&2 echo ERROR: %~1. Use REPL /? to get help.
exit /b

************* JScript portion **********/
var env=WScript.CreateObject("WScript.Shell").Environment("Process");
var args=WScript.Arguments;
var search=args.Item(0);
var replace=args.Item(1);
var options="g";
if (args.length>2) {
  options+=args.Item(2).toLowerCase();
}
var multi=(options.indexOf("m")>=0);
var srcVar=(options.indexOf("s")>=0);
if (srcVar) {
  options=options.replace(/s/g,"");
}
if (options.indexOf("e")>=0) {
  options=options.replace(/e/g,"");
  search=env(search);
  replace=env(replace);
}
if (options.indexOf("l")>=0) {
  options=options.replace(/l/g,"");
  search=search.replace(/([.^$*+?()[{\\|])/g,"\\$1");
  replace=replace.replace(/\$/g,"$$$$");
}
if (options.indexOf("x")>=0) {
  options=options.replace(/x/g,"");
  replace=replace.replace(/\\\\/g,"\\B");
  replace=replace.replace(/\\b/g,"\b");
  replace=replace.replace(/\\f/g,"\f");
  replace=replace.replace(/\\n/g,"\n");
  replace=replace.replace(/\\r/g,"\r");
  replace=replace.replace(/\\t/g,"\t");
  replace=replace.replace(/\\v/g,"\v");
  replace=replace.replace(/\\x[0-9a-fA-F]{2}|\\u[0-9a-fA-F]{4}/g,
    function($0,$1,$2){
      return String.fromCharCode(parseInt("0x"+$0.substring(2)));
    }
  );
  replace=replace.replace(/\\B/g,"\\");
}
var search=new RegExp(search,options);

if (srcVar) {
  WScript.Stdout.Write(env(args.Item(3)).replace(search,replace));
} else {
  while (!WScript.StdIn.AtEndOfStream) {
    if (multi) {
      WScript.Stdout.Write(WScript.StdIn.ReadAll().replace(search,replace));
    } else {
      WScript.Stdout.WriteLine(WScript.StdIn.ReadLine().replace(search,replace));
    }
  }
}
dbenham
  • 127,446
  • 28
  • 251
  • 390
  • dbenham, thanks a lot! I'm particularly grateful for your detailed explanations. No doubt this will be useful to others as well. So I tested your solution on the example file I posted in my question, and it worked. Lucky that the file does not contain these = ! * ~ Sadly it doesn't work on my real-life file. And I think the problem is one of size. Perhaps my file violates the 8k limitation or some other size limitation? As there is a limit to the number of characters in the comment, let me develop this a little in the next comment below. – PatrickT Dec 21 '11 at 16:34
  • my real-life file is just like the example I posted, except that it weighs 11,226,123 bytes. In openoffice, I get (over 1,700 pages, still counting, will update you with the final count), and most of them are filled with the offending line [HFloat(undefined),HFloat(undefined),HFloat(undefined)], with no space between multiple instances. This would be hilarious if it weren't totally sad. Is this a too-big-to-batch issue? Any way to bail me out? Thanks a bunch! – PatrickT Dec 21 '11 at 16:43
  • 11,226,123 bytes --> 1934 pages in openoffice. – PatrickT Dec 21 '11 at 16:53
  • dbenham, I have already put your code to good use on some other, smaller files that needed fixes, so your help has been very, very useful to me. A big thank you. – PatrickT Dec 21 '11 at 17:06
  • @annoporci - There is no inherent file size limit to the solution, but you might lose patience waiting for a huge file to process. I think you are running into the hard ~8k line size limit. I'm not aware of any solution in Windows batch that can get around this limitation. – dbenham Dec 21 '11 at 17:16
  • I see. So I did the deletion manually (in openoffice), the cleaned-up file weighs 15,742 bytes for 5 pages only. Might there be a two-step workaround as follows: first step, remove all but the first and last four pages (or, more or less, keep the first 150 characters and the last 16,000 characters, trash the rest)? and in a second step use your bat file to clean up thoroughly. – PatrickT Dec 21 '11 at 17:34
  • 8k per line? wait a minute, my file is 1934 pages but only 5 lines! – PatrickT Dec 21 '11 at 17:48
  • @annoporci - added a sed solution – dbenham Dec 21 '11 at 18:17
  • thanks so much! I was trying to adapt the following lines of code, but wasn't there yet, so thanks let me try your solution! #!/bin/bash OLD="[HFloat(undefined),HFloat(undefined),HFloat(undefined)]," NEW="" DPATH="C:/test/test.mpl" BPATH="C:/test/backup" TFILE="C:/test/out.tmp.$$" [ ! -d $BPATH ] && mkdir -p $BPATH || : for f in $DPATH do if [ -f $f -a -r $f ]; then /bin/cp -f $f $BPATH sed "s/$OLD/$NEW/g" "$f" > $TFILE && mv $TFILE "$f" else echo "Error: Cannot read $f" fi done /bin/rm $TFILE – PatrickT Dec 21 '11 at 19:09
  • On Win7, I copied the file sed.exe from unix tool into my working directory where the files reside. Is there anything else I need to do before I run your sed bat file? Do I need to install anything? (I'm asking because all I get is an identical copy of testCleaned.mpl with no changes to it, suggesting that the bat file runs correctly right up until the call to sed, but stops there, and I have waited 10 minutes so I don't think it's still running) – PatrickT Dec 21 '11 at 19:18
  • @annoporci - oops, I had a comma in the sed command that did not belong. All fixed now. – dbenham Dec 21 '11 at 19:39
  • okay, so I also downloaded sed-4.2.1-setup from the sourceforge repository --- and contrary to the sed.exe file from the unix utilities, this one is an installer --- so I now have sed v.4.2 installed. But still no luck. Did you mean for me to install sed v.1.4, which I understand is an earlier, sometimes faster version? At any rate, it's 4am in my part of the world, so I'll be off for a few hours. Thanks again for your help. I'll try again tomorrow. – PatrickT Dec 21 '11 at 19:46
  • Read your reply above about the extra comma. But you know what, there's a g missing too. After this correction, the script works. And it works very, very, very fast. dbenham from the bottom of my heart, I thank you and thank you again. sed -e"s/%OldString%/%NewString%/g" <%file% >%outfile% – PatrickT Dec 21 '11 at 19:56
  • jolly glad I have this great tool because when I manually cleaned the file I actually removed important data, the cleaned up file weighs 1,356,683 bytes and not 15,742 bytes. So SED does it, spread the word! – PatrickT Dec 21 '11 at 20:10
  • @annoporci - Don't forget to check the answer as accepted if this works for you. – dbenham Dec 21 '11 at 20:30
  • accepted it is! I had voted it up but hadn't seen the tick icon. Done now. Thanks! – PatrickT Dec 21 '11 at 21:33
2

The Batch file below have the same restrictions of previous solutions on characters that can be processed; these restrictions are inherent to all Batch language programs. However, this program should run faster if the file is large and the lines to replace are not too much. Lines with no replacement string are not processed, but directly copied to the output file.

@echo off
setlocal EnableDelayedExpansion
set "oldString=[HFloat(undefined),HFloat(undefined),HFloat(undefined)],"
set "newString="
findstr /N ^^ inFile.mpl > numberedFile.tmp
find /C ":" < numberedFile.tmp > lastLine.tmp
set /P lastLine=<lastLine.tmp
del lastLine.tmp
call :ProcessLines < numberedFile.tmp > outFile.mpl
del numberedFile.tmp
goto :EOF

:ProcessLines
set lastProcessedLine=0
for /F "delims=:" %%a in ('findstr /N /C:"%oldString%" inFile.mpl') do (
    call :copyUpToLine %%a
    echo(!line:%oldString%=%newString%!
)
set /A linesToCopy=lastLine-lastProcessedLine
for /L %%i in (1,1,%linesToCopy%) do (
    set /P line=
    echo(!line:*:=!
)
exit /B

:copyUpToLine number
set /A linesToCopy=%1-lastProcessedLine-1
for /L %%i in (1,1,%linesToCopy%) do (
    set /P line=
    echo(!line:*:=!
)
set /P line=
set line=!line:*:=!
set lastProcessedLine=%1
exit /B

I would appreciate if you may run a timing test on this an other solutions and post the results.

EDIT: I changed the set /A lastProcessedLine+=linesToCopy+1 line for the equivalent, but faster set lastProcessedLine=%1.

Aacini
  • 65,180
  • 12
  • 72
  • 108
  • Any solution using SET /P to read a file should list the additional restrictions: 1) lines limited to 1021 chars long 2) trailing control characters will be stripped from each line 3) file must use Windows style line termination of CRLF. – dbenham Dec 29 '11 at 02:42
  • Simple SET /P solution that performs search/replace on every line has constant performance for a given input file, regardless of the number of replacements. This complex SET /P solution slows dramatically as the number of replacements increases. This complex solution is marginally faster than the simple solution if minimal number of replacements, but it is MUCH slower if there are a lot of replacements. I prefer the simple approach. – dbenham Dec 29 '11 at 02:50
0

I'm no expert on batch files, so I can't offer a direct solution to your problem.

However, to solve your problem, it might be simpler to use an alternative to batch files.

For example, I'd recommend using http://www.csscript.net/ (if you know C#). This tool will allow you to run C# files like batch files, but giving you the power to write your script using C#, instead of horrible batch file syntax :)

Another alternative would be python, if you know python.

But I guess the point is, that this kind of task may be easier in another programming language.

Hybrid
  • 396
  • 4
  • 20
  • Thanks Hybrid. I can't comment on how horrible the bach file syntax is, that would depend on relative to what, I just had to dabble in postscript code, would you call that pretty? ;-) So I've looked at the CS-Script link, thanks, this will be useful to have. It seems a little more, how shall I put it, intuitive? Do you have python code at hand that would do the string replacement described? I'm learning python, so it'll be a good exercise for me. – PatrickT Dec 21 '11 at 16:01
  • I have python on linux and maple on windows -- it seems like a lot of work to have both on both OS, so since the broken file was produced on windows my first reflex was to look for a windows solution, but if the batch file script refuses to yield within the next few days I'll try with python. Thanks. – PatrickT Dec 21 '11 at 16:07
  • Sorry, I don't know Python, I just know it's a respected scripting language, and from the bits I've come across, it seems much more intuitive than batch file syntax. – Hybrid Dec 22 '11 at 10:21
0

You defined delims=<space>, that's a bad idea if you want to preserve your lines, as it splits after the first space.
You should change this to FOR /F "tokens=* delims=" ....

Your echo !str! >> testCleaned.mpl will always append one extra space to each line, better use echo(!str!>>testCleaned.mpl.

You will also lose all empty lines, and all exclamation marks in all lines.

You could also try the code of Improved BatchSubstitute.bat

jeb
  • 78,592
  • 17
  • 171
  • 225
  • Hi jeb, thanks a lot, I did as you say but it didn't help. I haven't tried the Improved BatchSubstitute.bat yet. Will asap. Thanks a lot. Please do bear with me. – PatrickT Dec 21 '11 at 14:56
  • Thanks jeb, so I'm having the issue described above, my file is very big, I guess I need a more powerful tool than a batch file... 11,226,123 bytes --> 1,934 pages in openoffice! most of these pages contain the offending string... – PatrickT Dec 21 '11 at 16:55