-3

I have a file which is, sometimes, not properly formatted because of CR/LF.

A good file looks like this:

R00023j Field1 Field2 .... CR/LF
R00024n Field1 Field2 .... CR/LF
R00025k Field1 Field2 .... CR/LF

But sometime I have a CR/LF inserted in one the fields and It produce a file like this:

R00023j Fiel CR/LF
d1 Field2 .... CR/LF
R00024n Field1 Field2 .... CR/LF
R00025k Field1 Field2 .... CR/LF

We can consider that we have "goods" CR/LF (at the end of the line) and "bad" CR/LF (into a field).

We can consider that a good CR/LF is the one with RxxxxxY immediatly after it, on the next line. All other CR/LF are bads and have to be replace by a . (dot).

x: digit
Y: letter

How is is possible to achieve a file data cleansing with a Windows batch file and RegEx?

aschipfl
  • 33,626
  • 12
  • 54
  • 99
JRMBAL
  • 9
  • 2

4 Answers4

0

Your spec is incomplete - A CR/LF is also good if it is at the very end of the file.

I have a simple solution using JREPL.BAT - A regex find/replace utility. JREPL is pure script (hybrid batch/JScript) that runs natively on any Windows machine from XP onward. Full documentation is available from the command line via jrepl /?, or jrepl /?? for paged help.

All that is needed is a simple one-liner from the command line. If your source is bad.txt, and you want to create good.txt, then:

jrepl "\r?\n(?=.)(?!R\d{5}[a-z])" "." /i /m /f bad.txt /o good.txt

You can overwrite the original file via /o -:

jrepl "\r?\n(?=.)(?!R\d{5}[a-z])" "." /i /m /f file.txt /o -

Use CALL JREPL if you put the command within a batch script.

Note that you must search across lines, so the /M option must be used, which loads the entire file into memory. This limits the file size that can be processed. I believe the limit is somewhere between 1 and 2 gigabytes.

dbenham
  • 127,446
  • 28
  • 251
  • 390
0

The following should work, if there are no special characters in your file and the additional CRLF doesn't occure inside RxxxxxY

@echo off
setlocal enabledelayedexpansion
for /f "delims=" %%a in (t.txt) do (
  echo %%a|findstr /b "R[0-9][0-9][0-9][0-9][0-9][a-z]">nul && (
    echo(!line!
    set line=%%a
  ) || (
    set line=!line!%%a
  )
)
echo %line%

When you have to adapt it to your needs, please pay attention to some findstr limitations

Community
  • 1
  • 1
Stephan
  • 53,940
  • 10
  • 58
  • 91
0

Although you did not show any own efforts, I decided to provide a script, because the task at hand seems to be quite challenging to me; so here we go (the code contans a lot of explanatory remarks, so do not be scared):

@echo off
setlocal EnableExtensions DisableDelayedExpansion

rem // Define constants here:
rem /* Regular expression string for `findstr` command (to match `RxxxxxY`);
rem    do not state `[a-z]` expression due to a nasty flaw of `findstr`!: */
set "_SEARCH=R[0-9][0-9][0-9][0-9][0-9][abcdefghijklmnopqrstuvwxyz]"
set "_REPLAC=." & rem // (character which each bad CR+LF is to be replaced by)

rem // Enumerate all files provided by command line arguments:
for %%F in (%*) do (
    rem /* Store paths of input and output files; to overwrite input files,
    rem    set `FILENEW` to `%%~fF` also: */
    set "FILEOLD=%%~fF"
    set "FILENEW=%%~dpnF_NEW%%~xF"
    rem // Initialise buffer for concatenated line strings:
    set "LBUF="
    rem // Read currently iterated file line by line (ignoring empty lines):
    setlocal EnableDelayedExpansion
    for /F "delims=" %%L in ('type "!FILEOLD!" ^& ^> "!FILENEW!" rem/') do (
        endlocal
        rem // Store current line string:
        set "LINE=%%L"
        setlocal EnableDelayedExpansion
        rem/ Double " due to pipe:
        set "LINE=!LINE:"=""!"
        rem /* Loop iterating once only over the current line with quotation
        rem    marks doubled in order to avoid trouble with the pipe later;
        rem    this allows disabling delayed expansion which might cause
        rem    trouble with pipes too in case `!` or `^` characters appear: */
        for /F "delims=" %%K in (^""!LINE!"^") do (
            endlocal
            rem /* Feed line string into `findstr` command using a pipe:
            rem    for case-insensitivity, add switch `/I` to `findstr`: */
            echo("%%~K"| > nul findstr /X /R /C:\"%_SEARCH%.*\"
            rem // Test whether `findstr` encountered a match:
            if ErrorLevel 1 (
                rem /* No match encountered, so CR+LF was bad, hence
                rem    concatenate previous buffer with current line,
                rem    separated by the predefined character; due to a
                rem    preceding `endlocal` command, `LINE` no longer
                rem    contains the doubled quotation marks at this point;
                rem    the `for /F` loop transfers the resulting string over
                rem    the `endlocal` barrier safely: */
                setlocal EnableDelayedExpansion
                for /F "delims=" %%E in (^""!LBUF!%_REPLAC%!LINE!"^") do (
                    endlocal
                    set "LBUF=%%~E"
                )
            ) else (
                rem /* Match encountered, so CR+LF is good, hence return
                rem    the current buffer; the `if` query avoids to output
                rem    an empty line initially: */
                if defined LBUF (
                    setlocal EnableDelayedExpansion
                    >> "!FILENEW!" echo(!LBUF!
                    endlocal
                )
                rem // Store the current line to the buffer:
                set "LBUF=%%L"
            )
        )
        setlocal EnableDelayedExpansion
    )
    rem // Return the remaining content of the buffer finally:
    >> "!FILENEW!" echo(!LBUF!
    endlocal
)

endlocal
exit /B

In case the search pattern (RxxxxxY) should be treated case-insensitively, simply add the /I switch to the findstr command.

Note that the overall length of each (concatenated) line is limited to about 8190 characters.

aschipfl
  • 33,626
  • 12
  • 54
  • 99
0

Thank you the everybody for your contributions. Dbenham, your are right when you say that my spec is incomplete when you say that at end of file CR/LF is good. Thank you for the JREPL link !

I solved the case with Regex and Powershell: $FileOut = $fileIn -creplace '\x0D\x0A(?![R][0-9]{5}[a-z])', '. '

With FileIn read with options: -Encoding UTF8 -Raw

JRMBAL
  • 9
  • 2
  • Instead of posting this the "thank you" comment as an answer, please consider accepting the most helpful answer and leaving a comment. – Alex Shesterov Dec 04 '16 at 19:16