-1

So I am writing a program and am using an existing library written by someone else. Their library is making a call to TheMovieDatabase.com and retrieving information about a movie, including the Youtube trailer name like 'sErD7Y00R_8'.

When I am debugging and view the trailer name string variable this value is stored in, it appears as 'sErD7Y00R_8', however when it gets inserted into my database or printed to console it seems to append a ? (question mark) to the end and appears like this: 'sErD7Y00R_8?'

This is causing me some problems obviously. I cannot figure out why it is doing this and how to fix it. I can only guess that it is some non regular text character or something, but that is only a guess.

Here is the link to the wrapper library: https://github.com/LordMike/TMDbLib/

This is the method I call in the wrapper library, passing in the ID 143049:

TMDbLib.Objects.Movies.Movie tmdbMovie = client.GetMovie(id, MovieMethods.Credits | MovieMethods.Keywords | MovieMethods.Images | MovieMethods.Trailers | MovieMethods.Reviews | MovieMethods.Releases);

and here is the print to console immediately after:

Console.WriteLine("'" + tmdbMovie.Trailers.Youtube[i].Source + "'");

.Length property returns 12 so it appears to be 1 character that it does not show in debugger but prints out as a ? in console

Per a comment I printed out the Encoding.GetBytes details:

Encoding the entire string:
System.Text.UTF7Encoding       : 20  38  :73 45 72 44 37 59 30 30 52 2B 41 46 38 2D 38 2B 49 41 34 2D 
System.Text.UTF8Encoding       : 14  39  :73 45 72 44 37 59 30 30 52 5F 38 E2 80 8E 
System.Text.UnicodeEncoding    : 24  26  :73 00 45 00 72 00 44 00 37 00 59 00 30 00 30 00 52 00 5F 00 38 00 0E 20 
System.Text.UnicodeEncoding    : 24  26  :00 73 00 45 00 72 00 44 00 37 00 59 00 30 00 30 00 52 00 5F 00 38 20 0E 
System.Text.UTF32Encoding      : 48  52  :73 00 00 00 45 00 00 00 72 00 00 00 44 00 00 00 37 00 00 00 59 00 00 00 30 00 00 00 30 00 00 00 52 00 00 00 5F 00 00 00 38 00 00 00 0E 20 00 00 

Debug screenshot

Kairan
  • 5,342
  • 27
  • 65
  • 104
  • 2
    In the debugger, use `string.Length` to see how long the string is, and `string[i]` for each character to see what the character is. – John Saunders Mar 31 '15 at 23:27
  • 1
    You don't think that posting your code and a reference to the library you're using would be useful? – Enigmativity Mar 31 '15 at 23:28
  • @Enigmativity I will update my post – Kairan Mar 31 '15 at 23:30
  • This mysterious question mark always appears at the end of the string ? – Beatles1692 Mar 31 '15 at 23:30
  • @Beatles1692 no it seems random, this particular movie has 2 trailer objects attached to it and the other one does not append a ? – Kairan Mar 31 '15 at 23:33
  • @JohnSaunders the .Length property returns 12 so it appears to be 1 character that it does not show in debugger but prints out as a ? in console – Kairan Mar 31 '15 at 23:37
  • I think @JohnSaunders solution can help you find the mysterious character that is presented as a question mark. Probably is a control character – Beatles1692 Mar 31 '15 at 23:39
  • Then try `string[11]` and see what the value is. That may help you track down where it comes from. Try `tmdbMovie.Trailers.Youtube[i].Source[11]` right after it comes in from the wrapper method. If it happens all the time for that property, then you can try `tmdbMovie.Trailers.Youtube[i].Source.Substring(0,11)` to drop the last character. – John Saunders Apr 01 '15 at 00:20
  • 1
    I recommend that you not just "remove non-ASCII characters". Instead, find out what that character is, and why it's there, and only remove characters which are present for the same reason. You might find that there are other "non-ASCII" characters which you should not remove. For instance, accented characters. – John Saunders Apr 01 '15 at 00:25
  • @JohnSaunders That is a good suggestion, but I am not sure how to do that. As of now I have only the information I have added in my post. How can I determine what the extra character is so I can remove just that, I would want to keep accented characters – Kairan Apr 01 '15 at 00:54
  • I told you how. Look in the debugger. – John Saunders Apr 01 '15 at 01:21
  • Debugger shows only 'sErD7Y00R_8' viewing the value, and it is immediately after being returned by the wrapper. Image here of debugger https://dl.dropboxusercontent.com/u/108669500/debug_image.jpg – Kairan Apr 01 '15 at 01:42

2 Answers2

5

It seems that the question mark appears because an encoding mismatch and since the string should be in ASCII encoding we can remove Non-ASCII characters to resolve the mismatch.

To do so we can use Regex to find Non-ASCII characters([^\u0000-\u007F]) and replace them with an empty string:

str=Regex.Replace(str, @"[^\u0000-\u007F]", string.Empty);
Beatles1692
  • 5,214
  • 34
  • 65
  • Thanks. Do you have some though on John's suggestion to figure out what unicode the character in the string is, this way I could google and find out what it is - mostly out of curiousity. – Kairan Apr 01 '15 at 16:27
0

You are probably correct that it's an encoding mismatch producing the ?. The bottom line though, is what can you do? Unless you intend to change TMDbLib, your only real option is to clean the return value of tmdbMovie.Trailers.Youtube[i].Source heuristically.

Jim W
  • 4,866
  • 1
  • 27
  • 43
  • Should I just replace all ? with empty string, or is there some secret string.someMethod() that will remove odd characters that cause this issue? I looked at the TMDbLib code and it is a bit more advanced then I can figure out. – Kairan Mar 31 '15 at 23:40
  • 1
    a character that can't be shown is replaced with ? thus you can't find any ? in original string. You should know exactly what is the character to remove it. – Beatles1692 Mar 31 '15 at 23:41
  • 1
    You can analyze the bytes of the string using the example here https://msdn.microsoft.com/en-us/library/ds4kkd55%28v=vs.110%29.aspx – Jim W Mar 31 '15 at 23:42
  • @Beatles1692 lol, that makes sense. Hoping someone might have the magic solution then as this is my first encounter with such strangeness – Kairan Mar 31 '15 at 23:43
  • 1
    You can use a regex to remove all control characters from your string – Beatles1692 Mar 31 '15 at 23:46
  • 1
    Take a look at http://stackoverflow.com/questions/123336/how-can-you-strip-non-ascii-characters-from-a-string-in-c – Beatles1692 Mar 31 '15 at 23:49
  • Is TMDbLib properly handling Unicde (or other character sets)? If so, it would be better to not strip out "bad characters" and, instead, properly handle them. http://www.joelonsoftware.com/articles/Unicode.html – JasCav Mar 31 '15 at 23:50
  • @Beatles1692 I used the suggestion at the link you provided for stripping non-ascii characters from a string, and that worked, it is now not printing out a ? to console. – Kairan Apr 01 '15 at 00:00
  • @JasCav I could not say if it is handling it correctly. I am not really familiar with Unicode and others and also tried looking at their code, it was a bit advanced for me. But you are probably right, if they were handling it correctly I should not be receiving this passed back as a string. Since it is a youtube trailer video name, I am not sure why TMDB would even send strange unicode stuff – Kairan Apr 01 '15 at 00:02
  • @Beatles1692 If you want to add an answer I can accept, or if Jim W wants to append his answer with this solution. Either way I will need to up you both some points for your help, thank you much - this was driving me crazy for several hours. – Kairan Apr 01 '15 at 00:07