0

I have a huge csv file(around 100GB). My problem is that I need to replace commas(,) in the file with semi-colon(;) except for the ones within double-inverted commas("").

I tried several methods but none seem to be working. Also this modification needs to be on Windows, hence sed and awk are out of option.

Example:
Input : "A,B,C",D,E,"FG","H,J",K
Output : "A,B,C";D;R;"FG";H,J;K

Once this is done, I need to remove the ".

I am able to remove the " from the file, but semi-colon replacement is failing everytime.

Please let me know if this is achievable through Powershell.

vanandsh
  • 92
  • 10
  • 100 GB csv? I ain't gonna believe it – Amit Joki Oct 11 '14 at 11:46
  • Your example seems to be wrong. Please fix it. – nhahtdh Oct 11 '14 at 11:55
  • Why are `sed` and `awk` out of the question? http://unxutils.sourceforge.net/ – alroc Oct 11 '14 at 11:56
  • If sed and awk can be used, it would be wonderful. But the file is in a Windows box and not Linux. – vanandsh Oct 11 '14 at 12:10
  • You shouldn't do it this way. Use a programming language that has support for CSV files. Read in your file as a comma-separated file, then write it out as a ;-separated file without quotes (unless not necessary to protect a semicolon in the content). – alexis Oct 11 '14 at 12:33
  • By the way it's not hard to install sed, awk, or a real language like python on Windows. – alexis Oct 11 '14 at 12:34
  • gc file.csv -r 1000 |% {$_ -replace '","|",|,"',';' -replace '^"'} | sc newfile.csv – mjolinor Oct 11 '14 at 12:44
  • Last comment should read: gc file.csv -r 1000 |% {$_ -replace '","|",|,"',';' -replace '^"'} | ac newfile.csv This is a different regex than the linked solution becuase it's not exactly the same problem - he doesn't want to keep the double quotes. – mjolinor Oct 11 '14 at 12:54

1 Answers1

1

This should take care of both the delimiter replacement and removing the double quotes:

 Get-Content ./File.csv -ReadCount 1000 |
 foreach { $_ -replace ',(?=(?:[^"]|"[^"]*")*$)',';' -replace '"' } |
 Add-Content ./NewFile.csv 

and handle a large file efficiently without needing third party utilities.

mjolinor
  • 66,130
  • 7
  • 114
  • 135
  • The code does not work in cases when fields are not wrapped within "". For example: "A,B",C,D,"E" It returns A,B;C,D;E" – vanandsh Oct 11 '14 at 13:09
  • I can take care of that trailing " easily enough (see updated answer). If that's not what you want, then I'm misunderstanding what you mean by removing the ". Exactly what should that final output look like? – mjolinor Oct 11 '14 at 13:20
  • My apologies. The regex in the post origally flagged as a duplicate does work http://stackoverflow.com/questions/632475/regex-to-pick-commas-outside-of-quotes (you should go upvote that), but the powershell implementation for what you're wanting to do is different than using it with the unix utilities. I've updated the answer. – mjolinor Oct 11 '14 at 13:54
  • Thanks a lot! Works perfectly. Is there any way the performance could be improved? Altering the readcount value doesnt seem to make much of a difference. – vanandsh Oct 13 '14 at 03:28