0

I have read that "preg_match_all" is not made for parsing large files, but I need to do that. I have increased:

pcre.backtrack_limit=1000000000
pcre.recursion_limit=1000000000

my PHP memory_limit is set to 5000M and script still ends without any error or exception within 0,2 sec...

Is the only solutinon split the 100M file into 100 small 1M files?

Thanks for help

Marek Javůrek
  • 945
  • 2
  • 11
  • 21
  • What's your code? It should work fine on large files, but it'll be a huge memory hog. A hugeeee one. – Nathanael Jul 03 '12 at 17:16
  • https://gist.github.com/3041137 its ugly.. I know... FOR, FOR, FOR ... – Marek Javůrek Jul 03 '12 at 17:18
  • PS: I have 8GB RAM (5GB free) – Marek Javůrek Jul 03 '12 at 17:21
  • If it's failing really quickly, you probably just have a `parse error`. Turn on PHP error reporting. Also, when sharing your code, you should post it in the question above, and format it nicely. – Nathanael Jul 03 '12 at 17:21
  • Its not parse error... on small chunk of data its working.. – Marek Javůrek Jul 03 '12 at 17:27
  • Show us part of your code - how match should look like? – Ωmega Jul 03 '12 at 17:30
  • 72 MB, and code is here: gist.github.com/3041137 – Marek Javůrek Jul 03 '12 at 17:37
  • Note that you can't just increase `pcre.recursion_limit` - you also need to increase the stack size of the running executable (i.e. `php.exe` or `httpd.exe` on Win32 machines). See: my related answer to: [RegExp in preg_match function returning browser error](http://stackoverflow.com/a/7627962/433790) which explains why really bad things can happen with PHP/PCRE and "large" target strings, (and how you can avoid them). – ridgerunner Jul 03 '12 at 18:40

2 Answers2

4

Consider using command line tools which are much better suited to deal with large amounts of data.

grep, sed, awk, or some combination thereof.

Andy Jones
  • 6,205
  • 4
  • 31
  • 47
3

Base on your code I suggest you to do it this way:

  1. Set variable $data to empty string

  2. Set variable $work to empty string; read block of data and append this string to $data

  3. Use regex #^(.*?)(<tr>\n(?!.*<tr>\n).*)$# to split $data to $work and $data

  4. Find all matches in $work

  5. Go back to point #2 while data available

  6. Find all matches in $data

Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • 2. how big blick of data? What if I read "datada"... The second data "da" will be not processed.. – Marek Javůrek Jul 03 '12 at 17:52
  • @MarekJavůrek - you can do this with **any size of block** and it will work. Regarding your question of example - it will be split to 2 parts and second will be processed with new block of data (in point #2 it says **append** which means to add to the end of existing one: `$data = $data . $new` or `$data .= $new`) - just code it and test it. – Ωmega Jul 03 '12 at 17:58