0

this is my first question here. I have the file (it's epf data exported from itunes) like this example EPF dataset

the columns are separated by SOH (ASCII character 1) and the rows by STX (ASCII character 2) + “n”. Everything is good, but the app descriptions are multiline and contain end-of-line chars. So the issue is when I tried to read the file line by line

  $fn = fopen("application_stripped","r");

  while(! feof($fn))  {
    $result = fgets($fn);
    print_r($result);
  }

  fclose($fn);

it detects the first end-of-line (that is in the description), but not actual end-of-line symbol that is in the end of row. The input files are very large (up to 4-5gb). Any ideas how to handle it?

PS: sorry for my English! :-)

Alexander S
  • 3
  • 1
  • 1
  • You probably can just use https://www.php.net/manual/en/function.stream-get-line.php instead - that allows you to _specify_ the ending delimiter: _“This function is nearly identical to fgets() except in that it allows end of line delimiters other than the standard \n, \r, and \r\n, and does not return the delimiter itself.”_ – CBroe Mar 25 '20 at 10:15
  • Looks like for somehow I need to skip the first 13 field separation symbols (because each line must contain 13 columns) and assume the end-of-line symbol is going after that. – Alexander S Mar 25 '20 at 10:23
  • Just read the line using the STX as delimiter (or STX + \n), then explode that line at the SOH characters afterwards …? – CBroe Mar 25 '20 at 10:32
  • Does using https://stackoverflow.com/questions/4541749/fgetcsv-fails-to-read-line-ending-in-mac-formatted-csv-file-any-better-solution help in detecting the EOL. – Nigel Ren Mar 25 '20 at 10:49

1 Answers1

0

thanks everyone for your help, looks like I was able handle the issue. Here is the code I did

$columnBreakpoint = 17;
$handle = @fopen("inputfile.txt", "r");
if ($handle) {

  #export_date
  #application_id
  #title
  #recommended_age
  #artist_name
  #seller_name
  #company_url
  #support_url
  #view_url
  #artwork_url_large
  #artwork_url_small
  #itunes_release_date
  #copyright
  #description
  #version
  #itunes_version
  #download_size

  $fileSeekPointer = 0;
  while(! feof($handle))  {
    // Reading a part of string
    $result = stream_get_line($handle, 10000);

    $positions = array();
    $pos = -1;
    // Detecting all the positions of the column separator symbol
    while (($pos = strpos($result, "\x01", $pos + 1)) !== false) {
      $positions[] = $pos;
    }
    // Getting 17th column separator position, because each product line must contain at least 17 columns
    $breakpointPos = $positions[$columnBreakpoint];
    // Stripping the line by this position
    $resultS = substr($result, 0, $breakpointPos);
    // Detecting position of end-of-line symbol in substring and strip by it
    $eolPos = strrpos($resultS, PHP_EOL);
    $resultS = substr($resultS, 0, $eolPos);
    // Now we must find the first position after actual EOL symbol
    $fileSeekPointer += ($breakpointPos + ($eolPos - $breakpointPos)) + 1;
    // And set file pointer on the first position after actual end of line
    fseek($handle, $fileSeekPointer);

    print '------------------'."\n";
    var_dump($resultS);
  }

  fclose($handle);
}
Alexander S
  • 3
  • 1
  • 1