7

I'm finding myself somewhat stumped on a simple problem. I'm trying to remove fancy quoting from a bunch of text files. I've the following script, where I'm trying a number of different replacement methods, but without results.

Here's an example that downloads the data from GitHub and attempts to convert.

$srcUrl="https://raw.github.com/gist/1129778/d4d899088ce7da19c12d822a711ab24e457c023f/gistfile1.txt"
$wc = New-Object net.WebClient
$wc.DownloadFile($srcUrl,"foo.txt")
$fancySingleQuotes = "[" + [string]::Join("",[char[]](0x2019, 0x2018)) + "]"

$c = Get-Content "foo.txt"
$c | % { `
        $_ = $_.Replace("’","'")
        $_ = $_.Replace("`“","`"")
        $_.Replace("`”","`"")
    } `
    |  Set-Content "foo2.txt"

What's the trick for this to work?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Scott Weinstein
  • 18,890
  • 14
  • 78
  • 115

4 Answers4

7

Here's a version that works:

    $srcUrl="https://raw.github.com/gist/1129778/d4d899088ce7da19c12d822a711ab24e457c023f/gistfile1.txt"
    $wc = New-Object net.WebClient
    $wc.DownloadFile($srcUrl,"C:\Users\hartez\SO6968270\foo.txt")

    $fancySingleQuotes = "[\u2019\u2018]"
    $fancyDoubleQuotes = "[\u201C\u201D]"

    $c = Get-Content "foo.txt" -Encoding UTF8

    $c | % { `
        $_ = [regex]::Replace($_, $fancySingleQuotes, "'")
        [regex]::Replace($_, $fancyDoubleQuotes, '"')
    } `
    |  Set-Content "foo2.txt"

The reason that manojlds' version wasn't working for you is that the encoding on the file you're getting from GitHub wasn't compatible with the Unicode characters in the regex. Reading it in as UTF-8 fixes the problem.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
E.Z. Hart
  • 5,717
  • 1
  • 30
  • 24
  • the last `$_.Replace("`“","'")` does push the output line – Scott Weinstein Aug 06 '11 at 17:26
  • 1
    Your answer doesn't add anything imo. Why do you need to replace and assign it to $_ and then return $_? Just `[regex]::Replace($_,$fancySingleQuotes, "'")` already returns to the pipeline. And the OP is already doing it. – manojlds Aug 06 '11 at 18:22
  • Starting with Powershell version 2, you can now use the -Replace operator, instead of [regex]::Replace(). $line = $line -replace '[\u2019\u2018]', "'" $line = $line -replace '[\u201C\u201D]', '"' – Nathan Hartley Nov 16 '11 at 18:34
2

The following works on the input and output that you had given:

$c = Get-Content $file
$c | % { `

    $_ = $_.Replace("’","'")
    $_ = $_.Replace("`“","`"")
    $_.Replace("`”","`"")
    } `
    |  Set-Content $file
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
manojlds
  • 290,304
  • 63
  • 469
  • 417
0

Your last replace places a left fancy quote with and single quote. Is that what you want? It doesn't match your sample output. Try this:

$_.Replace("`“","`"")
$_.Replace("`”","`"")
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
zdan
  • 28,667
  • 7
  • 60
  • 71
  • that is my answer, but what you have given as code is wrong, `"“”"` - replace will try to replace the entire string, not individual characters. and i believe they have to be escaped as well. – manojlds Aug 06 '11 at 18:57
  • @manojlds: right you are, the console was playing tricks on me. I've got to remember to use ISE for unicode. – zdan Aug 07 '11 at 00:53
-1

This Stack Overflow question is so close to what I need. I was looking for something that would check for any UTF8 and found this question:

How do I remove all non-ASCII characters with regex and Notepad++?

Which seems to work fine in PowerShell as well.

The regex they use that works in PowerShell is:

[^\x00-\x7F]+

Which will find any UTF-8 characters. You can hone the regex if you need to be more specific.

My input only had the curly quote(s) as UTF-8 characters, so this simple substitution worked:

# Replace the UTF-8 quote with standard single quote
$cq = $cq -replace "[^\x00-\x7F]+", "'"
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131