0

I'm trying to read a text file full of Twitter Screen Names and store them in a database. ScreenNames can't be more than 15 characters so one of my checks ensures that the name isn't more than 15 characters.

I've found something really strange going on when I try to upload AmericanExpress.

This is my text file contents:

americanexpress
AmericanExpress‎
AMERICANEXPRESS

And this is my code:

var names = new List<string>();
var badNames = new List<string>();

using (StreamReader reader = new StreamReader(file.InputStream, Encoding.UTF8))
{
    string line;
    while (!reader.EndOfStream)
    {
        line = reader.ReadLine();
        var name = line.ToLower().Trim();

        Debug.WriteLine(line + " " + line.Length + " " + name + " " + name.Length);
        if (name.Length > 15 || string.IsNullOrWhiteSpace(name))
        {
            badNames.Add(name);
            continue;
        }

        if (names.Contains(name))
        {
            continue;
        }

        names.Add(name);
    }
}

The first americanexpress passes the under 15 length test, the second fails, and the third passes. When I debug the code and hover over name during the second loop for AmericanExpress, this is what I get:

enter image description here enter image description here

And this is Debug output:

americanexpress 15 americanexpress 15
AmericanExpress‎ 16 americanexpress‎ 16
AMERICANEXPRESS 15 americanexpress 15

I've counted the characters in AmericanExpress at least 10 times, and I'm pretty sure it's only 15 character.

Does anyone have any idea why Visual Studio is telling me americanexpress.Length = 16?

SOLUTION

name = Regex.Replace(name, @"[^\u0000-\u007F]", string.Empty);

Owen
  • 4,229
  • 5
  • 42
  • 50
  • Is it test data you created or did you get it from somewhere? There can be zero-width UTF character inside text, not visible, but most certainly calculated into width of a word. Also - some letters can be utf characters looking like normal letters. I'm pretty sure c# string will calculate them as one character, but it's worth checking if that's the case – Jarek Oct 10 '13 at 11:41
  • It's data I've been given. AmericanExpress I copy pasted to my text file and the others I added. When I deleted AmericanExpress and wrote it myself it was down to 15 characters. – Owen Oct 10 '13 at 13:14

1 Answers1

2

After the s is a character, which is not visible but counts as a char. look at

name[15]    8206 '‎'

for information about the char 8206 see http://www.fileformat.info/info/unicode/char/200e/index.htm

possible solution: read only the ASCII values

var name = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(line.ToLower().Trim()));
lordkain
  • 3,061
  • 1
  • 13
  • 18
  • Yep that's the one! But changing it to ASCII I get this: "AmericanExpress???". I just found the solution to removing it here: http://stackoverflow.com/questions/123336/how-can-you-strip-non-ascii-characters-from-a-string-in-c – Owen Oct 10 '13 at 13:08
  • Good to hear you fixed you're problem – lordkain Oct 10 '13 at 13:13
  • Yeh thought I might be going a little crazy for a while. Thanks for pointing me in the right direction. – Owen Oct 10 '13 at 13:21