0

I have a file which is over 400Mb

It is a timetable database which is only distributed in this way.

In this text file there is a string which marks the start of a data record.

This string always begins with "BSN", and likewise there is a string that marks the end of the data record which always starts with "LT"

What i'm trying to fathom is how to chop the data file into chunks, containing 1000 data records. then when this cycle is complete, i can import those files sequentially.

The created files must be numbered sequentially in a new folder...

[edit] the record set varies greatly in length [/edit]

Below is a sample of one of the groups:

BSNC031551112111206240000001   << DATA RECORD START >> 
BX         EMYEM129000                                                           
LOSHEFFLD 2235 2235                                                
LIDORESNJ                                              
LISPDN                                       
LTDRBY    2326 23266           << DATA RECORD END >>                                        
BSNC033501112111205130000001   << NEXT RECORD >>
BX         EMYEM118600    

*the << >> tags are added for your understanding, they do not exist in the file.

I currently read in the file using the PHP fopen / fgets method here

Community
  • 1
  • 1
Deano
  • 353
  • 1
  • 2
  • 15

2 Answers2

1

Something like this should work for you

$fp = fopen($bigfile, "r");

$file_num = 1;
$prefix = "FILE_";
$suffix = ".DAT";
$buff = "";
$recNo = 0;
while ($rec = fgets($fp)){
    if (substr($rec, 0,3) == 'BSN'){
        $recNo++;
    }

    if ($recNo == 1000){
        // reset record counter
        $recNo = 1;
        // flush out the file
        file_put_contents($prefix.$file_num.$suffix, $buff);
        // clear the buffer
        $buff = "";
        // increment the file counter
        $file_num++;
    }
    // add to the buffer
    $buff.= $rec;
}
fclose($fp);

// flush the remainder
if ($buff) file_put_contents($prefix.$file_num.$suffix, $buff);
Orangepill
  • 24,500
  • 3
  • 42
  • 63
-2

If you have predefined data structure you can use split command (unix):

 split -l 6000 your_big_file.txt data_

This command divides big file to small 6000 strings in each (1000 data records).

Or if data structure is nonuniform you can use perl one liner:

perl -n -e '/^BSNC/ and open FH, ">output_".$n++; print FH;' your_big_file

Perl can parse large files line by line instead of slurping the whole file into memory.

New file will be created for each data record. Don't worry Ext4 file system has a theoretical limit of 4 billion files per directory.

After this it's possible to import all data to database using PHP script.

Valery Viktorovsky
  • 6,487
  • 3
  • 39
  • 47