3

I am trying to run a RegEx to locate degree characters (\u00B0|\u00BA degrees in addition to locating the other form of ' --> \u00B4). I am reading latitude and longitude DMS coordinates like this one: 12º30'23.256547"S

The problem is with the way I am reading the file as I can manually inject a string like the one below (format is latitude, longitude, description):

const string myTestString = @"12º30'23.256547""S, 12º30'23.256547""W, Somewhere";

and my regex is matching as expected - I can also see the º values where, when I am using the streamreader, I see a � for all unrecognized characters (the º symbol being included as one of those unrecognized characters)

I've tried:

            var sr = new StreamReader(dlg.File.OpenRead(), Encoding.UTF8);
            var sr = new StreamReader(dlg.File.OpenRead(), Encoding.Unicode);
            var sr = new StreamReader(dlg.File.OpenRead(), Encoding.BigEndianUnicode);

in addition to the default ASCII.

Either way I read the file, I end up with these special characters. Any advice would be greatly appreciated!!

Jordan
  • 5,085
  • 7
  • 34
  • 50
  • I tried this, but this didn't help: – Jordan Feb 11 '11 at 13:38
  • (+1) This was helpful for me, albeit my solution was in Powershell. Ironic that I had to specify the encoding as 'Default' or else it didn't work! `$Reader = New-Object System.IO.StreamReader($filepath,[System.Text.Encoding]::Default)` – Steve can help Jun 24 '21 at 09:02

3 Answers3

3

You've tried various encodings... but presumably not the right one. You shouldn't just be guessing at encodings - find out what encoding it's really using, and use that. StreamReader itself is absolutely fine. It can deal with any encoding you give it, but it does have to match the encoding used when writing the file out.

Where does the file come from? What has written it out?

If it was written out with Notepad, it may well be using Encoding.Default, which is the system's default encoding (i.e. it will vary from machine to machine). If at all possible, change whatever is creating the file to use a single standard encoding - personally I'm a big fan of UTF-8.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
1

You need to identify what encoding the file was saved in, and use that when you read it with your streamreader.

If it is created using a regular texteditor I'm guessing the default encoding is either Windows-1252 or ISO-8859-1.

The degree symbol is 0xBA in ISO-8859-1 and goes outside of the 7bit ASCII table. I don't know how the Encoding.ASCII interprets it.

Otherwise, it might be easier to just make sure to save the file as UTF-8 if you have that possibility.

The reason that it works when you define the string in code is because .NET will always work with strings with it's internal encoding (UCS-2?), so what StreamReader do is convert the bytes it is reading from the file into the internal encoding using the encoding that you specify when you create the StreamReader.

jishi
  • 24,126
  • 6
  • 49
  • 75
  • yes you're correct, the default is Unicode, my mistake...I was concerned saving in NotePad could be causing the issue...but I tried with other formats and I run into the same problem - I also tried what I will add to a new thread below... – Jordan Feb 11 '11 at 13:36
  • You can select the encoding to use, when using “File → Save As…” in notepad. Use UTF-8 instead of the default ANSI and pass Encoding.UTF8 to the StreamReader. It should work. – ollb Feb 11 '11 at 13:50
  • Thank you! That did work. It will do the trick for now, but I will ultimately need to develop a work-around since this is to be used by the client. Thanks again! – Jordan Feb 11 '11 at 14:20
  • If you can't control the encoding of the document, you will have to try to identify the encoding firsthand. That would never be 100% accurate though. See this question: http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file – jishi Feb 11 '11 at 15:29
0

You can open your file being read in an editor like Notepad++ to see the Encoding type of the file and change it to UTF-8. Then reading as you are doing 'var sr = new StreamReader(dlg.File.OpenRead(), Encoding.UTF8);' will work. I could read degree symbol by doing this