1

I have huge csv file with data, and some of lines are incorrect and contains enters. When file is imported into Excel then I need to correct hundreds lines manually. I have regex which is work in Notepad++ and remove enters from line which is not start with specific string in this case ";" However same regex is not working in PowerShell script.

Example of input

;BP;7165378;XX_RAW;200SSS952;EU-PL;PL02;PL02;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
15:00:00;;;;Jhon Name;;;;;;;;9444253;;;;;;;;;;;;;"Jhon Name";;;;;;;;;;Jhon Name;;;;;;;;Final Check Approved;;;;;;;;;09.01.2023;;;;;Approve;;;;;;12077;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

How it should look:

;BP;7165378;XX_RAW;200SSS952;EU-PL;PL02;PL02;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;15:00:00;;;;Jhon Name;;;;;;;;9444253;;;;;;;;;;;;;"Jhon Name";;;;;;;;;;Jhon Name;;;;;;;;Final Check Approved;;;;;;;;;09.01.2023;;;;;Approve;;;;;;12077;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

Code:

$content = Get-Content -path "C:\Users\TUF17\Desktop\File\Fix\xx_fix_temp.csv" 
$content -Replace '"\R(?!;)"', ' ' |  Out-File "C:\Users\TUF17\Desktop\File\Fix\xx_noenters.csv" 
Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
majan
  • 125
  • 12

2 Answers2

2

It has to do with line continuation \ in your ps script.

I would also suggest adding -Raw if you want to get content of file as single string, rather than an array of strings, for easier replacing.

I'm assuming it's a .csv file you are using.

$content = Get-Content -Path "C:\Users\TUF17\Desktop\File\Fix\xx_fix_temp.csv" -Raw
$content -Replace '(?m)(^[^;].*)\r?\n(?!;)', '$1 ' |  Out-File "C:\Users\TUF17\Desktop\File\Fix\xx_noenters.csv"
Kreetchy
  • 688
  • 4
  • 14
  • It doesn't work, this is example of this issue, now I have two lines but it should be just one line ;BP;7165378;XX_RAW;200SSS952;EU-PL;PL02;PL02;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; (ENETR is HERE) 15:00:00;;;;Jhon Name;;;;;;;;9444253;;;;;;;;;;;;;"Jhon Name";;;;;;;;;;Jhon Name;;;;;;;;Final Check Approved;;;;;;;;;09.01.2023;;;;;Approve;;;;;;12077;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; – majan Feb 18 '23 at 14:21
  • I've updated the code with what you want – Kreetchy Feb 18 '23 at 14:36
  • 1
    `-Raw` is good pointer, but "line continuation ``\`` in your ps script" is confusing here: there's no line _continuation_, only an effective line _break_ (`\r` == CR, presumably) that is in the _input file_, not in the script. Apart from changing ``\R`` to ``\r`` (or perhaps ``\r?\n``), the OP's simple `-replace` operation will do. Note that unless you add `-NoNewFile` to the `Out-File` call, an extra (CRLF) newline will be appended to the file. – mklement0 Feb 18 '23 at 17:07
2

Building on the helpful comments on the question:

  • In order to perform replacements across lines of a text file, you need to either read the file in full - with Get-Content -Raw - or perform stateful line-by-line processing, such as with the -File parameter of a switch statement.

    • Note: While you could also do stateful line-by-line processing by combining Get-Content (without -Raw) with a ForEach-Object call, such a solution would be much slower - see this answer.
  • Your regex, '"\R(?!;)"', has two problems:

    • It accidentally uses embedded " quoting. Use only '...' quoting. PowerShell has no special syntax for regex literals - it simply uses strings.
      To avoid confusion with PowerShell's own up-front string interpolation, it is better to use verbatim '...' strings rather than expandable (interpolating) "..." strings - see the conceptual about_Quoting_Rules help topic.

    • \R is an unsupported regex escape sequence; you presumably meant \r, i.e. a CR char. (CARRIAGE RETURN, U+000D)

      • If you instead want to match CRLF, a Windows-format newline sequence, use \r\n

      • If you want to match LF (LINE FEED, U+000A)) alone (a Unix-format newline), use \n

      • If you want to match both newline formats, use \r?\n

      • As an aside: While use of CR alone is rare in practice, PowerShell treats stand-alone CR characters as newlines as well, which is why Get-Content without -Raw, which reads line by line (as you've tried) wouldn't work.


Get-Content -Raw solution (easier and faster than switch -File, but requires the whole file to fit into memory twice):

# Adjust the '\r' part as needed (see above).
(Get-Content -Raw -LiteralPath $inFile) -replace '\r(?!;)' |
  Set-Content -NoNewLine -Encoding utf8 -LiteralPath $outFile

Note:

  • By not specifying a substitution operand to -replace, the command removes all newlines not followed by a ; ((?!;)), effectively joining the line that follows the CR directly to the previous line, which is the desired behavior based on your sample output.

  • For saving text, Set-Content is a bit faster than Out-File (it'll make no appreciable difference here, given that only a single, large string is written).

    • -NoNewLine prevents a(n additional) trailing newline from getting appended to the file.
    • -Encoding utf8 specifies the output character encoding. Note that PowerShell never preserves the input character encoding, so unless you use -Encoding on output, you'll get the respective cmdlet's default character encoding, which in Windows PowerShell varies from cmdlet to cmdlet; in PowerShell (Core) 7+, the consistent default is now BOM-less UTF-8. Note that in Windows PowerShell -Encoding utf8 always create a file with a BOM; see this answer for background information and workarounds.
mklement0
  • 382,024
  • 64
  • 607
  • 775