2

I’m working with the royal mail PAF database in csv format (approx 29 million lines), and need to split the data into files of approx 1000 lines.

I found this solution to write the files, but do not know how to a. open the file, and b. tell the script to delete the lines from the original file after copying them.

Can anyone advise?

Community
  • 1
  • 1
Sofia Rose
  • 203
  • 2
  • 8

2 Answers2

4

Does it need to be in PHP? If you're on a Unix/Linux system, you can use the split command.

split --lines=1000 mybigfile.csv

http://en.wikipedia.org/wiki/Split_%28Unix%29

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
  • Unfortunately yes, as its going to be hooked into a wordpress mu-plugin to run from wp-cron – Sofia Rose Jan 06 '14 at 04:02
  • Is your WordPress install running on a Unix/Linux system? Chances are you can run external programs from your plugin, unless the sysadmin has prevented that for security reasons. See http://us2.php.net/function.exec – Andy Lester Jan 06 '14 at 04:06
  • Its a linux box, but yeah they have locked this down :( – Sofia Rose Jan 06 '14 at 04:09
3

I don't know the royal PAF database, but you open files with fopen(), read a line with fgets() and delete files with unlink().

Your found solution shows the idea of splitting every 1000 lines, but in your case there is no need for calling any csv function at all. It's just a simple "copy each 1000 lines into new file".

$bigFile = fopen("paf.csv", "r");
$j = 0;

while(! feof($bigFile)) {
    $smallFile = fopen("small$j.csv", "w");
    $j++;

    for ($i = 0; $i < 1000 && ! feof($bigFile); $i++) {
        fwrite($smallFile, fgets($bigFile));

    }
    fclose($smallFile);

}
fclose($bigFile);
unlink("paf.csv");
Markus Malkusch
  • 7,738
  • 2
  • 38
  • 67
  • Thanks for the advice, do you know of a function to just delete the first part of a file instead of the whole thing? – Sofia Rose Jan 06 '14 at 03:51
  • You can't just delete parts in an atomic function. You have to overwrite the old file with the new content. In your case I suggest to simply delete the file when you have finished splitting. – Markus Malkusch Jan 06 '14 at 03:53
  • Okay thanks, I was trying to do it this way so php wouldn’t time out, and if it did it could pick up where it left off. – Sofia Rose Jan 06 '14 at 03:55
  • You can [increase the timelimit](http://www.php.net/manual/en/function.set-time-limit.php). But it's a good idea to continue work. Take the existing splitted files as marker of your work. – Markus Malkusch Jan 06 '14 at 03:58
  • This will be a monthly task run on cron. I will check if wpengine allow this. – Sofia Rose Jan 06 '14 at 03:59
  • So your saying if it wrote 10 files so far tell php to start reading at line 10000? i never thought of doing it that way. Thanks – Sofia Rose Jan 06 '14 at 04:00
  • If your line length are fixed you can do that. If not you can [start](http://www.php.net/fseek) at the byte of the combined [size](http://www.php.net/manual/en/function.filesize.php) of your existing files. – Markus Malkusch Jan 06 '14 at 04:06
  • I shall use the combined size method. Thanks for your help. If your interested, heres the first 30 lines fron the sample data file :) https://gist.github.com/anonymous/8278066 – Sofia Rose Jan 06 '14 at 04:11