Combining answers from here and here I created a function which checks if the character I'm looking at is EOL. I need it for strings with mixed line endings and possibly mixed encodings. Maybe even sanitize it by replacing all line endings with \n
// check if (possibly multibyte) character is EOL
protected function _is_eol($char) {
static $eols = array(
"\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
"\0x000A", // [UNICODE] LF: Line Feed, U+000A
"\0x000B", // [UNICODE] VT: Vertical Tab, U+000B
"\0x000C", // [UNICODE] FF: Form Feed, U+000C
"\0x000D", // [UNICODE] CR: Carriage Return, U+000D
"\0x0085", // [UNICODE] NEL: Next Line, U+0085
"\0x2028", // [UNICODE] LS: Line Separator, U+2028
"\0x2029", // [UNICODE] PS: Paragraph Separator, U+2029
"\0x0D0A", // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
"\0x0A0D", // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
"\0x0A", // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
"\0x0D", // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
"\0x1E", // [ASCII] RS: QNX (pre-POSIX)
"\0x15" // [EBCDEIC] NEL: OS/390, OS/400
);
$is_eol = false;
foreach($eols as $eol){
if($char === $eol){
$is_eol = true;
break;
}
}
return $is_eol;
}
I might need to take a peek at the next character, when the current character is CR or LF so I don't mistake CRLF or LFCR as two line endings, but otherwise this looks good to me. Problem is that I have no knowledge about encodings and no data to test it yet.
Are there any fatal mistakes in my approach?
Am I missing line separators from other popular encodings?
The code says [UNICODE] but isn't there a difference between utf8/16/32?
I found this snippet on github:
if ($this->file_encoding = 'UTF-16LE') {
$this->line_separator = "\x0A\x00";
}
elseif ($this->file_encoding = 'UTF-16BE') {
$this->line_separator = "\x00\x0A";
}
elseif ($this->file_encoding = 'UTF-32LE') {
$this->line_separator = "\x0A\x00\x00\x00";
}
elseif ($this->file_encoding = 'UTF-32BE') {
$this->line_separator = "\x00\x00\x00\x0A";
}
It made me think, that I might be missing some. If I'm not mistaken, the last one "\x00\x00\x00\x0A"
would be "0x0000000A"
?