13

Reference: This is a self-answered question. It was meant to share the knowledge, Q&A style.

How do I detect the type of end of line character in PHP?

PS: I've been writing this code from scratch for too long now, so I decided to share it on SO, plus, I'm sure someone will find ways for improvement.

Christian
  • 27,509
  • 17
  • 111
  • 155

7 Answers7

9
/**
 * Detects the end-of-line character of a string.
 * @param string $str The string to check.
 * @param string $default Default EOL (if not detected).
 * @return string The detected EOL, or default one.
 */
function detectEol($str, $default=''){
    static $eols = array(
        "\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
        "\0x000A",     // [UNICODE] LF: Line Feed, U+000A
        "\0x000B",     // [UNICODE] VT: Vertical Tab, U+000B
        "\0x000C",     // [UNICODE] FF: Form Feed, U+000C
        "\0x000D",     // [UNICODE] CR: Carriage Return, U+000D
        "\0x0085",     // [UNICODE] NEL: Next Line, U+0085
        "\0x2028",     // [UNICODE] LS: Line Separator, U+2028
        "\0x2029",     // [UNICODE] PS: Paragraph Separator, U+2029
        "\0x0D0A",     // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
        "\0x0A0D",     // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
        "\0x0A",       // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
        "\0x0D",       // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
        "\0x1E",       // [ASCII] RS: QNX (pre-POSIX)
        //"\0x76",       // [?????] NEWLINE: ZX80, ZX81 [DEPRECATED]
        "\0x15",       // [EBCDEIC] NEL: OS/390, OS/400
    );
    $cur_cnt = 0;
    $cur_eol = $default;
    foreach($eols as $eol){
        if(($count = substr_count($str, $eol)) > $cur_cnt){
            $cur_cnt = $count;
            $cur_eol = $eol;
        }
    }
    return $cur_eol;
}

Notes:

  • Needs to check encoding type
  • Needs to somehow know that we may be on an exotic system like ZX8x (since ASCII x76 is a regular letter) @radu raised a good point, in my case, it's not worth the effort to handle ZX8x systems nicely.
  • Should I split the function into two? mb_detect_eol() (multibyte) and detect_eol()
Christian
  • 27,509
  • 17
  • 111
  • 155
  • 1
    You are sure about mixing encodings? At least `0A` appears twice. @Alexander The source is linked in the question. Christian just wanted to ask a question, that he wants to answer himself. – KingCrunch Jun 16 '12 at 20:42
  • 1
    So which one is "/r/n"? Doesn't the server have a way of taking care of whichever environment it is operating on? – Chibueze Opata Jun 16 '12 at 20:44
  • 1
    **Alexander**, I've answered my own question. See my note in the main question. **KingCrunch** To be honest, I didn't think about that. **Chibueze Opata** `\r\n` is ASCII CR+LF (Windows). If it wasn't obvious, my code aims to find EOL of any string, even if it came from another server, client or a remote database. PHP is completely oblivious to what your client browser is using as EOL. – Christian Jun 16 '12 at 21:25
  • 1
    Whats about "mixed line endings"? For me it feels not unusual, when a vertical tab and a regular line feed appears in the same file with a paragraph separator. And this code snippet silently assumes, that every file is well formed – KingCrunch Jun 16 '12 at 21:37
  • Hmm, that's a good point. It should cater for cases where different EOL types might exist. Then again, I'll have to check which of them make sense to co-exist. – Christian Jun 16 '12 at 21:38
  • @Christian, also, are you sure `0x1E`, `0x76` and `0x15` can't be part of a multibyte character? Maybe it would be a good idea to leave these out, if you're not convinced that they're going to be useful (the OSs mentioned look pretty old). – rid Jun 16 '12 at 21:43
  • @Radu Wikipedia seems to claim so. I don't have an IBM mainframe nor a Sinclair ZX8x at hand to check. **:D** – Christian Jun 16 '12 at 21:46
  • @Christian, what I mean is, even if they are indeed EOL on these platforms, they might also be part of a UTF-8 character for example. So if the document contains that character, you would erroneously find that it contains an EOL, when in fact it doesn't. For example, there is the Unicode character "latin capital letter sharp s" which has the code `U+1E9E`. If the document would contain this character, your code would conclude that it contains an EOL instead of the "sharp s" character, because you're looking for `0x1E`, which is part of the "sharp s" character. – rid Jun 16 '12 at 21:48
  • 2
    @Christian, if these (very) old systems are not a primary concern, better safe than sorry, I think. Otherwise, maybe try to determine the document's encoding before applying this method. – rid Jun 16 '12 at 21:52
7

Wouldn't it be easier to just replace everything except new lines using regex?

The dot matches a single character, without caring what that character is. The only exception are newline characters.

With that in mind, we do some magic:

$string = 'some string with new lines';
$newlines = preg_replace('/.*/', '', $string);
// $newlines is now filled with new lines, we only need one
$newline = substr($newlines, 0, 1);

Not sure if we can trust regex to do all this, but I don't have anything to test with.

enter image description here

flagg19
  • 1,782
  • 2
  • 22
  • 27
ohaal
  • 5,208
  • 2
  • 34
  • 53
  • What if you have mixed content? For example first few lines end in CR+LF and the rest in LF? I need something that tells me which line ending is used primarily. – transilvlad Nov 14 '13 at 09:45
  • 1
    Interesting question. I'm not even sure if my theory works, but if it does, this might work for you, returning the most used newline: `$arr = array_count_values(str_split($newlines));arsort($arr);return key($arr);` – ohaal Nov 14 '13 at 10:36
  • 1
    Sorry it does not work if the entire document has CR+LF it return LF. – transilvlad Nov 14 '13 at 11:33
  • 2
    By default regex considers 'newline' to be only \n. (This can be changed with build options). However I did find a regex that will work above instead of the '/.*/' and it is '/(*ANYCRLF)./'. There is a very good article about regex and line endings here: https://nikic.github.io/2011/12/10/PCRE-and-newlines.html – Richard - Rogue Wave Limited Sep 19 '16 at 17:06
  • @transilvlad just include both then. Something like `[^\n\r]` to match everything but newline and carriage return. Then count both. If they're equal, assume its windows style file. If unequal, mixed file. If only `\n`, unix style file. Of course, this won't always work either. It won't tell you if there are windows-style newlines inside of quoted text, while the overall file is unix-style. But I think it will work for yor question. – Buttle Butkus Apr 20 '23 at 21:50
4

The here already given answers provide the user of enough information. The following code (based on the already given anwers) might help even more:

  • It provides a reference of the found EOL
  • The detection sets also a key which can be used by an application to this reference.
  • It shows how to use the reference in a utility class.
  • Shows how to use it for detection of a file returning the key name of the found EOL.
  • I hope this is of usage to all of you.
    /**
    Newline characters in different Operating Systems
    The names given to the different sequences are:
    ============================================================================================
    NewL  Chars       Name     Description
    ----- ----------- -------- ------------------------------------------------------------------
    LF    0x0A        UNIX     Apple OSX, UNIX, Linux
    CR    0x0D        TRS80    Commodore, Acorn BBC, ZX Spectrum, TRS-80, Apple II family, etc
    LFCR  0x0A 0x0D   ACORN    Acorn BBC and RISC OS spooled text output.
    CRLF  0x0D 0x0A   WINDOWS  Microsoft Windows, DEC TOPS-10, RT-11 and most other early non-Unix
                              and non-IBM OSes, CP/M, MP/M, DOS (MS-DOS, PC DOS, etc.), OS/2,
    ----- ----------- -------- ------------------------------------------------------------------
    */
    const EOL_UNIX    = 'lf';        // Code: \n
    const EOL_TRS80   = 'cr';        // Code: \r
    const EOL_ACORN   = 'lfcr';      // Code: \n \r
    const EOL_WINDOWS = 'crlf';      // Code: \r \n
    

    then use the following code in a static class Utility to detect

    /**
    Detects the end-of-line character of a string.
    @param string $str      The string to check.
    @param string $key      [io] Name of the detected eol key.
    @return string The detected EOL, or default one.
    */
    public static function detectEOL($str, &$key) {
       static $eols = array(
         Util::EOL_ACORN   => "\n\r",  // 0x0A - 0x0D - acorn BBC
         Util::EOL_WINDOWS => "\r\n",  // 0x0D - 0x0A - Windows, DOS OS/2
         Util::EOL_UNIX    => "\n",    // 0x0A -      - Unix, OSX
         Util::EOL_TRS80   => "\r",    // 0x0D -      - Apple ][, TRS80
      );
    
      $key = "";
      $curCount = 0;
      $curEol = '';
      foreach($eols as $k => $eol) {
         if( ($count = substr_count($str, $eol)) > $curCount) {
            $curCount = $count;
            $curEol = $eol;
            $key = $k;
         }
      }
      return $curEol;
    }  // detectEOL
    

    and then for a file:

    /**
    Detects the EOL of an file by checking the first line.
    @param string  $fileName    File to be tested (full pathname).
    @return boolean false | Used key = enum('cr', 'lf', crlf').
    @uses detectEOL
    */
    public static function detectFileEOL($fileName) {
       if (!file_exists($fileName)) {
         return false;
       }
    
       // Gets the line length
       $handle = @fopen($fileName, "r");
       if ($handle === false) {
          return false;
       }
       $line = fgets($handle);
       $key = "";
       <Your-Class-Name>::detectEOL($line, $key);
    
       return $key;
    }  // detectFileEOL
    

    Change the Your-Class-Name into your name for the implementation Class (all static members).

    Harm
    • 787
    • 7
    • 11
    4

    My answer, because I could make neither ohaal's one or transilvlad's one work, is:

    function detect_newline_type($content) {
        $arr = array_count_values(
                   explode(
                       ' ',
                       preg_replace(
                           '/[^\r\n]*(\r\n|\n|\r)/',
                           '\1 ',
                           $content
                       )
                   )
               );
        arsort($arr);
        return key($arr);
    }
    

    Explanation:

    The general idea in both proposed solutions is good, but implementation details hinder the usefulness of those answers.

    Indeed, the point of this function is to return the kind of newline used in a file, and that newline can either be one or two character long.

    This alone renders the use of str_split() incorrect. The only way to cut the tokens correctly is to use a function that cuts a string with variable lengths, based on character detection instead. That is when explode() comes into play.

    But to give useful markers to explode, it is necessary to replace the right characters, in the right amount, by the right match. And most of the magic happens in the regular expression.

    3 points have to be considered:

    1. using .* as suggested by ohaal will not work. While it is true that . will not match newline characters, on a system where \r is not a newline character, or part of a newline character, . will match it incorrectly (reminder: we are detecting newlines because they could be different from the ones on our system. Otherwise there is no point).
    2. replacing /[^\r\n]*/ with anything will "work" to make the text vanish, but will be an issue as soon as we want to have a separator (since we remove all characters but the newlines, any character that isn't a newline will be a valid separator). Hence the idea to create a match with the newline, and use a backreference to that match in the replacement.
    3. It is possible that in the content, multiple newlines will be in a row. However we do not want to group them in that case, since they will be seen by the rest of the code as different types of newlines. That is why the list of newlines is explicitly stated in the match for the backreference.
    Community
    • 1
    • 1
    7heo.tk
    • 1,074
    • 12
    • 23
    • This worked for me. To test whether a script has been saved with Windows or Unix line encodings you just need to call strlen() on the string sent back by this function (2 = Windows CR+LF, 1 = Unix LF). – Noel Whitemore Feb 23 '18 at 15:58
    1

    Based on ohaal's answer.

    This can return one or two caracters for EOL like LF, CR+LF..

      $eols = array_count_values(str_split(preg_replace("/[^\r\n]/", "", $string)));
      $eola = array_keys($eols, max($eols));
      $eol = implode("", $eola);
    
    transilvlad
    • 13,974
    • 13
    • 45
    • 80
    • Interesting subject and interesting discussion. Curious though if we could have a case where the real EOL is two characters (CR+LF for example) but a lone CR or LF is found elsewhere in the document. Then, this lone character will have a higher occurence count than the real EOL. Should we not, in this case, have a way to give priority to the two character solution even though the single character has a higher count? Shoot me down if I'm way off base; I have thick skin. :-) – Kiser Feb 03 '19 at 17:12
    • Check my solution, why do you care about what is inside a line? – Sorin Trimbitas Oct 22 '20 at 06:14
    0

    If you care just about LF/CRs here is a method I wrote. No need to treat all possible cases of files you'll never ever see.

    /**
     * @param  string  $path
     * @param  string  $format  real or human_readable
     * @return false|string
     * @author Sorin-Iulian Trimbitas
     */
    public static function getLineBreak(string $path, $format = 'real')
    {
        // Hopefully my idea is ok, the rest of the stuff from the internet doesn't seem to work ok in some cases
        // 1. Take the first line of the CSV
        $file = new \SplFileObject($path);
        $line = $file->getCurrentLine();
        // Do we have an empty line?
        if (mb_strlen($line) == 1) {
            // Try the next line
            $file->next();
            $line = $file->getCurrentLine();
            if (mb_strlen($line) == 1) {
                // Give up
                return false;
            }
        }
        // What does we have at its end?
        $last_char = mb_substr($line, -1);
        $penultimate_char = mb_substr($line, -2, 1);
        if ($last_char == "\n" || $last_char == "\r") {
            $real_format = $last_char;
            if ($penultimate_char == "\n" || $penultimate_char == "\r") {
                $real_format = $penultimate_char.$real_format;
            }
            if ($format == 'real') {
                return $real_format;
            }
            return str_replace(["\n", "\r"], ['LF', 'CR'], $real_format);
        }
        return false;
    }
    
    Sorin Trimbitas
    • 1,467
    • 18
    • 35
    0

    I'm not use php as main language but trying to be simply and memory aware, if have some corrections, comments or edits are welcome.

    <?php
    function eol_detect(&$str, $buffSize=1024) {
        $buff = substr($str, 0, $buffSize);
        $eol = null;
    
        if (strpos($buff, "\r\n") !== false)
            $eol = "\r\n";
        elseif (strpos($buff, "\n") !== false)
            $eol = "\n";
        elseif (strpos($buff, "\r") !== false)
            $eol = "\r";
        
        return $eol;
    }
    
    Felipe Buccioni
    • 19,109
    • 2
    • 28
    • 28