1

I have a function that writes ~120Kb-150Kb HTML and meta data on ~8000 .md files with fixed names every few minutes:

a-agilent-technologies-healthcare-nyse-us-39d4
aa-alcoa-basic-materials-nyse-us-159a
aaau-perth-mint-physical-gold--nyse-us-8ed9
aaba-altaba-financial-services-nasdaq-us-26f5
aac-healthcare-nyse-us-e92a
aadr-advisorshares-dorsey-wright-adr--nyse-us-d842
aal-airlines-industrials-nasdaq-us-29eb
  • If file does not exist, it generates/writes quite fast.
  • If however the file exists, it does the same much slower, since the existing file carries ~150KB data.

How do I solve this problem?

Do I generate a new file with a new name in the same directory, and unlink the older file in the for loop?

or do I generate a new folder and write all files then I unlink the previous directory? The problem with this method is that sometimes 90% of files are being rewritten and some remain the same.


Code

This function is being called in a for loop, which you can see it in this link

public static function writeFinalStringOnDatabase($equity_symbol, $md_file_content, $no_extension_filename)
{
    /**
     *@var is the MD file content with meta and entire HTML
     */
    $md_file_content = $md_file_content . ConfigConstants::NEW_LINE . ConfigConstants::NEW_LINE;
    $dir = __DIR__ . ConfigConstants::DIR_FRONT_SYMBOLS_MD_FILES; // symbols front directory
    $new_filename = EQ::generateFileNameFromLeadingURL($no_extension_filename, $dir);

    if (file_exists($new_filename)) {
        if (is_writable($new_filename)) {
            file_put_contents($new_filename, $md_file_content);
            if (EQ::isLocalServer()) {
                echo $equity_symbol . "  " . ConfigConstants::NEW_LINE;
            }

        } else {
            if (EQ::isLocalServer()) {
                echo $equity_symbol . " symbol MD file is not writable in " . __METHOD__ . "  Maybe, check permissions!" . ConfigConstants::NEW_LINE;
            }
        }
    } else {
        $fh = fopen($new_filename, 'wb');
        fwrite($fh, $md_file_content);
        fclose($fh);
        if (EQ::isLocalServer()) {
            echo $equity_symbol . " front md file does not exit in " . __METHOD__ . " It's writing on the database now " . ConfigConstants::NEW_LINE;
        }

    }

}
Community
  • 1
  • 1
Emma
  • 27,428
  • 11
  • 44
  • 69
  • 2
    Why do you use `file_put_contents` and `fwrite` if they act the same? Or let me put it this way: Why do you have the `if (file_exists($new_filename))` at all? – Dharman May 09 '19 at 19:09
  • 1
    If you're generating the (mostly) same 8000 files every few minutes, it seems to me that the better solution would be to generate them on the fly as they're requested. (Or are they not requested via the web?) – Alex Howansky May 09 '19 at 19:10
  • 1
    Sounds like a bad idea in Total as disk partitions can slow down then many files exists in one directory see https://stackoverflow.com/questions/2994544/how-many-files-in-a-directory-is-too-many-on-windows-and-linux – Raymond Nijland May 09 '19 at 19:25

1 Answers1

0

I haven't programmed in PHP for years, but this question has drawn my interest today. :D

Suggestion

How do I solve this problem? Do I generate a new file with a new name in the same directory, and unlink the older file in the for loop?

Simply use the 3 amigos fopen(), fwrite() & fclose() again, since fwrite will also overwrite the entire content of an existing file.

if (file_exists($new_filename)) {
    if (is_writable($new_filename)) {
        $fh = fopen($new_filename,'wb');
        fwrite($fh, $md_file_content);
        fclose($fh);

        if (EQ::isLocalServer()) {
            echo $equity_symbol . "  " . ConfigConstants::NEW_LINE;
        }
    } else {
        if (EQ::isLocalServer()) {
            echo $equity_symbol . " symbol MD file is not writable in " . __METHOD__ . "  Maybe, check permissions!" . ConfigConstants::NEW_LINE;
        }
    }
} else {
    $fh = fopen($new_filename, 'wb');
    fwrite($fh, $md_file_content);
    fclose($fh);
    if (EQ::isLocalServer()) {
        echo $equity_symbol . " front md file does not exit in " . __METHOD__ . " It's writing on the database now " . ConfigConstants::NEW_LINE;
    }
}

For the sake of DRY principle:

// It's smart to put the logging and similar tasks in a separate function, 
// after you end up writing the same thing over and over again.
public static function log($content)
{
    if (EQ::isLocalServer()) {
        echo $content;
    }
}

public static function writeFinalStringOnDatabase($equity_symbol, $md_file_content, $no_extension_filename)
{
    $md_file_content = $md_file_content . ConfigConstants::NEW_LINE . ConfigConstants::NEW_LINE;
    $dir = __DIR__ . ConfigConstants::DIR_FRONT_SYMBOLS_MD_FILES; // symbols front directory
    $new_filename = EQ::generateFileNameFromLeadingURL($no_extension_filename, $dir);
    $file_already_exists = file_exists($new_filename);

    if ($file_already_exists && !is_writable($new_filename)) {
        EQ::log($equity_symbol . " symbol MD file is not writable in " . __METHOD__ . "  Maybe, check permissions!" . ConfigConstants::NEW_LINE);
    } else {
        $fh = fopen($new_filename,'wb'); // you should also check whether fopen succeeded
        fwrite($fh, $md_file_content); // you should also check whether fwrite succeeded

        if ($file_already_exists) {
            EQ::log($equity_symbol . "  " . ConfigConstants::NEW_LINE);
        } else {
            EQ::log($equity_symbol . " front md file does not exit in " . __METHOD__ . " It's writing on the database now " . ConfigConstants::NEW_LINE);
        }

        fclose($fh);
    }
}

Possible cause

tl;dr To much overhead due to the Zend string API being used.

The official PHP manual says:

file_put_contents() is identical to calling fopen(), fwrite() and fclose() successively to write data to a file.

However, if you look at the source code of PHP on GitHub, you can see that the part "writing data" is done slightly different in file_put_contents() and fwrite().

  • In the fwrite function the raw input data (= $md_file_content) is directly accessed in order to write the buffer data to the stream context:

    Line 1171:

ret = php_stream_write(stream, input, num_bytes);
  • In the file_put_contents function on the other hand the Zend string API is used (which I never heard before). Here the input data and length is encapsulated for some reason.

    Line 662

numbytes = php_stream_write(stream, Z_STRVAL_P(data), Z_STRLEN_P(data));

(The Z_STR.... macros are defined here, if you are interested).

So, my suspicion is that possibly the Zend string API is causing the overhead while using file_put_contents.


side note

At first I thought that every file_put_contents() call creates a new stream context, since the lines related to creating context were also slightly different:

PHP_NAMED_FUNCTION(php_if_fopen) (Reference):

context = php_stream_context_from_zval(zcontext, 0);

PHP_FUNCTION(file_put_contents) (Reference):

context = php_stream_context_from_zval(zcontext, flags & PHP_FILE_NO_DEFAULT_CONTEXT);

However, on closer inspection, the php_stream_context_from_zval call is made effectively with the same params, that is the first param zcontext is null, and since you don't pass any flags to file_put_contents, flags & PHP_FILE_NO_DEFAULT_CONTEXT becomes also 0 and is passed as second param.

So, I guess the default stream context is re-used here on every call. Since it's apparently a stream of type persistent it is not disposed after the php_stream_close() call. So the Fazit, as the Germans say, is there is apparently either no additional overhead or equally same overhead regarding the creation or reusing a stream context in both cases.

Thank you for reading.