1

I have a strange issue here and I'm hoping that someone can shed some light on this for me. I'm looking for why this behavior occurs and how I can overcome it with PHP.

I have a directory with names of people and some of those names use ASCII characters. The one that I'm having an issue with is the e-acute é. I am using scandir to get the contents of the directory. One of the names containing the e-acute fails, while others using the same character do not fail. I've copied the characters into my IDE and the character that scandir fails on, reports that it is a regular e, although that not what I'm visually looking at; these characters look identical to me.

These are the characters that succeed and fail:

é This one shows as a regular 'e' and fails with scandir
é This one shows as e-acute and succeeds with scandir

Can someone tell me why this is? Also, is there a way to do a conversion on these types of characters so that I can be sure that scandir will not fail?

I should mention that I am already using a UTF8 header at the top of the script.

NaN
  • 1,286
  • 2
  • 16
  • 29
  • It's different. I'm not using a database here, and my question is *why* this behavior is happening. If I understand the **why** then I can fix it. – NaN Nov 20 '17 at 22:26
  • which os? version? it sounds like a file system\os issue, not a code\php one –  Nov 20 '17 at 22:28
  • @nogad, I'm using this on Windows 7. – NaN Nov 20 '17 at 22:31
  • Assuming NTFS, filenames are stored as UTF16; but generally speaking Windows fopen uses ANSI (locale-specific) – Mark Baker Nov 20 '17 at 22:43
  • @MarkBaker, Mark, it is NTFS. Let me ask you this... When development is complete, I will be using a Linux box to host. Will the issue be resolved by using Linux or is there anything else I need to do? – NaN Nov 20 '17 at 22:44
  • Not a good idea to try and use filesystem for this, especially if you want it to work cross-platform.... [what charset encoding is used for filenames and paths on linux](https://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux) – Mark Baker Nov 20 '17 at 22:46
  • Thanks for the link, Mark. I'll have a look at it now. – NaN Nov 20 '17 at 22:50
  • As a recommendation.... generate a simple unique alphanumeric filename/directoryname to use for each file on the filesystem, and use a table on a db to reference a UTF-8 name with the filesystem name – Mark Baker Nov 20 '17 at 22:52
  • I have to say, this whole UTF-8 thing is a nightmare. I don't understand why it is that there is no native PHP function that can detect a bad byte stream, or in my case, invalid characters. I like your suggestion, and as a final solution, I will use it, but there is something in me that thinks that there has to be some other way of either converting bad characters or at the very least, having a way to test for these characters so I can convert them programmatically. – NaN Nov 20 '17 at 23:05

0 Answers0