164

I have a form that allows the user to either upload a text file or copy/paste the contents of the file into a textarea. I can easily differentiate between the two and put whichever one they entered into a string variable, but where do I go from there?

I need to iterate over each line of the string (preferably not worrying about newlines on different machines), make sure that it has exactly one token (no spaces, tabs, commas, etc.), sanitize the data, then generate an SQL query based off of all of the lines.

I'm a fairly good programmer, so I know the general idea about how to do it, but it's been so long since I worked with PHP that I feel I am searching for the wrong things and thus coming up with useless information. The key problem I'm having is that I want to read the contents of the string line-by-line. If it were a file, it would be easy.

I'm mostly looking for useful PHP functions, not an algorithm for how to do it. Any suggestions?

Jon Seigel
  • 12,251
  • 8
  • 58
  • 92
Topher Fangio
  • 20,372
  • 15
  • 61
  • 94
  • You may want to normalize the newlines first. The method `s($myString)->normalizeLineEndings()` is available with https://github.com/delight-im/PHP-Str (library under MIT License) which has lots of other useful string helpers. You may want to take a look at the source code. – caw Jul 25 '16 at 04:03

8 Answers8

225

preg_split the variable containing the text, and iterate over the returned array:

foreach(preg_split("/((\r?\n)|(\r\n?))/", $subject) as $line){
    // do stuff with $line
} 
lorem monkey
  • 3,942
  • 3
  • 35
  • 49
Kyril
  • 3,056
  • 1
  • 19
  • 13
  • Will this handle ^M in addition to \n\r ? – Topher Fangio Sep 22 '09 at 21:37
  • I'm not sure if the ascii carriage return gets converted to \r once it's placed inside a variable. If not you can always use a split()/exlope() with the ascii value instead -- ch(13) – Kyril Sep 22 '09 at 21:52
  • It will handle CR+LF (Windows and various Internet protocols) and LF (Unix) forms. It won't handle plain CR (Mac) or LF+CR (don't know of anything that uses this). But the brackets in the regexp are unnecessary. – Stewart Jan 04 '11 at 00:11
  • 12
    A better regexp is `/((\r?\n)|(\r\n?))/`. – Félix Saparelli Nov 12 '11 at 05:02
  • 3
    To match Unix LF (\n), MacOS<9 CR (\r), Windows CR+LF (\r\n) and rare LF+CR (\n\r) it should be: `/((\r?\n)|(\n?\r))/` – Waiting for Dev... Mar 07 '12 at 16:09
  • what about the simple `'/\r\n|\r|\n/'`? –  Dec 24 '12 at 02:37
  • 2
    This is likely to bomb catastrophically for multi-byte data. – pguardiario Jul 12 '13 at 10:42
  • Regular expression is complete overkill here, slow and memory intensive. Also, have you set a limit to the size of the text file? – Erwin Wessels Jan 23 '14 at 07:03
  • @CrisG `/(\r?\n?)/` can match `''` (the empty string), which will essentially split the string into characters. – Félix Saparelli Mar 02 '14 at 03:19
  • If you want something short, [CodeAngry shows](http://stackoverflow.com/a/17755934/231788) that `/[\r\n]+/` works. I have to add the caveat that _this will match single and multiple newlines indiscriminately_, which might be unwanted. Of course, proper newline-splitting shouldn't be done using regexp (for performance reasons). – Félix Saparelli Mar 02 '14 at 03:31
189

I would like to propose a significantly faster (and memory efficient) alternative: strtok rather than preg_split.

$separator = "\r\n";
$line = strtok($subject, $separator);

while ($line !== false) {
    # do something with $line
    $line = strtok( $separator );
}

Testing the performance, I iterated 100 times over a test file with 17 thousand lines: preg_split took 27.7 seconds, whereas strtok took 1.4 seconds.

Note that though the $separator is defined as "\r\n", strtok will separate on either character - and as of PHP4.1.0, skip empty lines/tokens.

See the strtok manual entry: http://php.net/strtok

Outspaced
  • 408
  • 5
  • 15
Erwin Wessels
  • 2,972
  • 1
  • 24
  • 17
  • Thanks! I'll give this a try to see if we can speed it up. Although, we're usually only dealing with about 200 lines, so speed generally isn't an issue :-) – Topher Fangio Feb 11 '13 at 21:19
  • 28
    **+1 for performance considerations** when dealing with large line sets. – CodeAngry Jul 19 '13 at 21:32
  • 5
    Although this function api is a total mess (call with different parameters) this is the best solution. Neither `prey_split` nor `explode` should be used for yielding structured string fragments. It's like **aiming to a fly with a bazooka**. – Maciej Sz Mar 25 '14 at 14:53
  • 2
    If you check the memory usage while the app is running, then you'll see the magic. It actually pulls the file you're reading into memory in the event you loop through each of the lines, *and* it keeps your token location. You'll want to flush that to be truly memory efficient. http://php.net/strtok#103051 – AbsoluteƵERØ Aug 30 '17 at 06:43
  • 2
    quick note, using `strtok()` on something else inside that `while` loop will break things. I was also using it to grab everything in a string up to the first space (https://stackoverflow.com/a/2477411/1767412) and took me a minute to realize why things weren't going as planned – But those new buttons though.. Jun 13 '18 at 19:55
  • 2
    should be the accepted answer, probably the fastest solution from all options. – John Nov 25 '18 at 02:33
  • I'm getting ridiculous memory usage using this method for some reason. Just to run about 20 URLs, it was exhausting 134M of memory. `explode` worked just fine. – zen Apr 24 '19 at 18:25
  • Maybe, @zen, you're not tokenizing until the end, or have the subject string stay in memory? – Erwin Wessels Apr 25 '19 at 12:53
  • Thanks, `strtok()` is pure gold! My script ran considerably faster compared to the explode newlines method I was using earlier. – Prahlad Yeri Jun 04 '23 at 19:06
107

If you need to handle newlines in diferent systems you can simply use the PHP predefined constant PHP_EOL (http://php.net/manual/en/reserved.constants.php) and simply use explode to avoid the overhead of the regular expression engine.

$lines = explode(PHP_EOL, $subject);
FerCa
  • 2,067
  • 3
  • 15
  • 18
  • 35
    Beware: It will work *on different systems* but it won't work well with strings *from different systems*. The [PHP Manual](http://php.net/manual/en/reserved.constants.php) states that `PHP_EOL (string)` is _The correct 'End Of Line' symbol for **this** platform._ – wadim Dec 11 '13 at 14:57
  • 1
    @wadim is right! If you are processing a Windows text file on a Unix server, it will fail. – javsmo Jan 09 '14 at 17:47
  • 1
    Beware that depending on the length of your lines, this can eat very large amounts of memory for big strings. – Synchro Mar 03 '15 at 14:30
  • 1
    Note that if the last line contains a line terminator, then this will also return another empty string after that. –  May 19 '18 at 10:57
25

It's overly-complicated and ugly but in my opinion this is the way to go:

$fp = fopen("php://memory", 'r+');
fputs($fp, $data);
rewind($fp);
while($line = fgets($fp)){
  // deal with $line
}
fclose($fp);
pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • 2
    +1 and you can also use `php://temp` for storing larger data to temporary disk file. – CodeAngry Jul 19 '13 at 21:34
  • 6
    It should be noted that that this allows you to detect empty lines, unlike the strtok() solution. The documentation is at http://php.net/manual/en/wrappers.php.php#refsect2-wrappers.php-unknown-unknown-unknown-unknown-unknown-descriptios – Josip Rodin Jul 28 '15 at 14:58
9

Potential memory issues with strtok:

Since one of the suggested solutions uses strtok, unfortunately it doesn't point out a potential memory issue (though it claims to be memory efficient). When using strtok according to the manual, the:

Note that only the first call to strtok uses the string argument. Every subsequent call to strtok only needs the token to use, as it keeps track of where it is in the current string.

It does this by loading the file into memory. If you're using large files, you need to flush them if you're done looping through the file.

<?php
function process($str) {
    $line = strtok($str, PHP_EOL);

    /*do something with the first line here...*/

    while ($line !== FALSE) {
        // get the next line
        $line = strtok(PHP_EOL);

        /*do something with the rest of the lines here...*/

    }
    //the bit that frees up memory
    strtok('', '');
}

If you're only concerned with physical files (eg. datamining):

According to the manual, for the file upload part you can use the file command:

 //Create the array
 $lines = file( $some_file );

 foreach ( $lines as $line ) {
   //do something here.
 }
AbsoluteƵERØ
  • 7,816
  • 2
  • 24
  • 35
6
foreach(preg_split('~[\r\n]+~', $text) as $line){
    if(empty($line) or ctype_space($line)) continue; // skip only spaces
    // if(!strlen($line = trim($line))) continue; // or trim by force and skip empty
    // $line is trimmed and nice here so use it
}

^ this is how you break lines properly, cross-platform compatible with Regexp :)

CodeAngry
  • 12,760
  • 3
  • 50
  • 57
5

Kyril's answer is best considering you need to be able to handle newlines on different machines.

"I'm mostly looking for useful PHP functions, not an algorithm for how to do it. Any suggestions?"

I use these a lot:

  • explode() can be used to split a string into an array, given a single delimiter.
  • implode() is explode's counterpart, to go from array back to string.
Vega
  • 27,856
  • 27
  • 95
  • 103
Joe Kiley
  • 471
  • 2
  • 6
3

Similar as @pguardiario, but using a more "modern" (OOP) interface:

$fileObject = new \SplFileObject('php://memory', 'r+');
$fileObject->fwrite($content);
$fileObject->rewind();

while ($fileObject->valid()) {
    $line = $fileObject->current();
    $fileObject->next();
}
Fabien Sa
  • 9,135
  • 4
  • 37
  • 44