0

I am having a csv file, quotes around each field.

there are some fields, which can have multiple double quotes inside them. I want to escape each of them with additional double quote.

","ABC "XYZ" PQRS","
","ABC "XYZ"","
","ABC "A" "B" TEST","
","ABC 2.5" "C" Test","

I took help from the link and able to cover for scenarios with single double quote inside content, using regular expression [regex]$r='(","[^"]+"[^"]+?",")'. But, stuck up in the cases, where there are multiple double quotes inside content.

[regex]$r='(","[^"]+"[^"]+"",")' # Not working
get-content C:\Projects\MyProject\testRegexFordoublequotes.csv | foreach {

  #save each line to a variable to make it easier to track

  $line=$_

  #look for a regex match

  $find=$r.matches($line)
  
  if ($find[0].Success) { 

      foreach ($match in $find) {

        #the original string we matched on

        $found=$match.value

        #replace the substring

        $replace= '","'+  $found.Trim('","').Replace('""','"').Replace('"','""')+ '","'

        #replace the full string and write to the pipeline

        $line -replace $found,$replace

      } #foreach
       

  } #if

  else {

        #no match so write the line to pipeline

        $line

    }

 } | Set-Content C:\Projects\MyProject\modified.csv -Force

Can you please help me in defining regex which will be helpful for multiple double quotes inside field.

Venkataraman R
  • 12,181
  • 2
  • 31
  • 58
  • 1
    Use something like `Import-Csv` instead of trying to parse it out yourself. – Matt Clark Feb 08 '21 at 05:11
  • @MattClark, thanks for your suggestion. Importing is failing, due to double quotes inside content for some fields. Also, there are ~250k files like this, and there are many fields like this. So, wanted to do it using regex to do it faster – Venkataraman R Feb 08 '21 at 05:15
  • 1
    [Have fun with this one](https://stackoverflow.com/a/18147076/1790644). – Matt Clark Feb 08 '21 at 05:18
  • @MattClark, I am not very good in regex. Looks very complicated to implement. As you are suggestion, let me see, if I can handle it using csv parser. but, I need to figure out the files with these types of content – Venkataraman R Feb 08 '21 at 05:35
  • As others pointed out, it is better to use a CSV parser such as https://github.com/peterthoeny/parse-csv-js. However, your text does not conform to CSV, so a CSV parser would not help. Are these actual lines or shortened lines? A `","` at the beginning and end look strange. Also, are all fields enclosed in quotes, e.g. are adjacent fields always separated by `","`? – Peter Thoeny Feb 08 '21 at 07:56
  • @PeterThoeny, I am just putting part of the text which I want to match. We are having csv with quoted strings. – Venkataraman R Feb 08 '21 at 08:25

3 Answers3

2

It is probably easier to search for the valid delimiters (e.g. "\s*,\s*") and split your lines into fields, and than simply correct each (invalid) single double quote with 2 quotes in each field.
Than rebuild the fields to a record by surrounding the fields with double quotes and join them with the csv (comma) delimiter

Input

$Csv = @'
"Field","ABC "XYZ" PQRS","Field"
"Field","ABC "XYZ"","Field"
"Field","ABC "A" "B" TEST","Field"
"Field","ABC 2.5" "C" Test","Field"
'@ -Split '[\r\n]+'

Script

$Csv | # replace with: get-content .\testRegexFordoublequotes.csv |
Foreach-Object {
    $Line = $_ -Replace '^\s*"' -Replace '"\s*$' # Strip outer double quotes
    $Fields = $Line -Split '"\s*,\s*"'           # Split line into fields
    $Fields = $Fields -Replace '"', '""'         # Escape each " in each field
    '"' + ($Fields -Join '","') + '"'            # Rejoin the fields to line
} # append: | Set-Content .\modified.csv -Force

Output

"Field","ABC ""XYZ"" PQRS","Field"
"Field","ABC ""XYZ""","Field"
"Field","ABC ""A"" ""B"" TEST","Field"
"Field","ABC 2.5"" ""C"" Test","Field"
iRon
  • 20,463
  • 10
  • 53
  • 79
  • 1
    FYI, I just simplified the script in the answer a bit as you actually don't need to iterate through each field if you use the PowerShell [`-Replace`](https://learn.microsoft.com/powershell/module/microsoft.powershell.core/about/about_comparison_operators) operator: [*When the `` to the -replace operator is a collection, PowerShell applies the replacement to every value in the collection*](https://learn.microsoft.com/powershell/module/microsoft.powershell.core/about/about_comparison_operators) – iRon Feb 08 '21 at 10:58
1

Based on the conversation we had in the comments of the post, the files are non-conformant CSV files, hence a CSV parser is of no help.

Mind you, you have an undefined case if a single cell happens to have a some textext","more text. That cell will be treated as two cells because of the unescaped quotes.

Now to the regex. You could find a regex with lookahead and lookbehind, but I think it is easier to blindly double all quotes, then cleanup the unintended ones, e.g. at the start & end of line, and in between cells.

I am not familiar with powershell, but here is a JavaScript/pseudo code that you can easily convert to the powershell syntax. I am using a single line containing all your stated test cases; you would loop through the lines in your file:

/* assume $line is:
"Start","ABC "XYZ" PQRS","ABC "XYZ"","ABC "A" "B" TEST","ABC 2.5" "C" Test","End"
*/

$fixed = $line.replace(/"/g, '""')
              .replace(/"",""/g, '","')
              .replace(/^""/, '"')
              .replace(/""$/, '"')

/* $fixed is:
"Start","ABC ""XYZ"" PQRS","ABC ""XYZ""","ABC ""A"" ""B"" TEST","ABC 2.5"" ""C"" Test","End"
*/

Explanation:

  • .replace(/"/g, '""') - blindly double all quotes
  • .replace(/"",""/g, '","') - restore "","" back to ","
  • .replace(/^""/, '"') - restore "" at start of line to "
  • .replace(/""$/, '"') - restore "" at end of line to "
Peter Thoeny
  • 7,379
  • 1
  • 10
  • 20
  • thanks. different approach to solve the problem. very useful. I chose the other one, considering it was useful for somebody to refer to in future in the power shell syntax. Thanks for your help. – Venkataraman R Feb 08 '21 at 09:56
1

You may do the following to see what the changes will be:

(Get-Content file.csv) -replace '(?<!^|",)"(?!,"|$)','""'

You can simply pipe to Set-Content to save the new contents:

(Get-Content file.csv) -replace '(?<!^|",)"(?!,"|$)','""' |
    Set-Content file.csv

Explanation:

(?<!^|",) is a negative lookbehind look for any previous position that is not the beginning of a line (^) or ",. (?!,"|$) is a negative lookahead for any next position that is not the end of a line ($) or ,". If those look around conditions are met, the " is replaced with "".

AdminOfThings
  • 23,946
  • 4
  • 17
  • 27