28

I have a CIFS share from Windows Server 2012 R2 mounted on Ubuntu 14.04.2 LTS (kernel 3.13.0-61-generic) like this

/etc/fstab

//10.1.2.3/Share /Share cifs credentials=/root/.smbcredentials/share_user,user=share_user,dirmode=0770,filemode=0660,uid=4000,gid=5000,forceuid,forcegid,noserverino,cache=none 0 0

The gid=5000 corresponds to group www-data which runs a PHP process.

The files are mounted correctly when I check via the console logged in as the www-data user - they are readable and removable (the operations that are used by the PHP script).

The PHP script is processing about 50-70 000 files per day. The files are created on the host Windows machine and some time later the PHP script running on the Linux machine is notified about a new file, checks if the file exists (file_exists), reads it and deletes. Usually all works fine, but sometimes (a few hundreds to 1-2 000 per day) the PHP script raises an error that the file does not exist. That should never be the case, since it is notified only of actually existing files.

When I manually check those files reported as not existing, they are correctly accessible on the Ubuntu machine and have a creation date from before the PHP script checked their existence.

Then I trigger the PHP script manually to pick up that file and it is picked up without problems.

What I already tried

There are multiple similar questions, but I seem to have exhausted all the advices:

  • I added clearstatcache() before checking file_exists($f)
  • The file and directory permissions are OK (exactly the same file is picked up correctly later on)
  • The path used for checking file_exists($f) is an absolute path with no special characters - the file paths are always of format /Share/11/222/333.zip (with various digits)
  • I used noserverino share mount parameter
  • I used cache=none share mount parameter

/proc/fs/cifs/Stats/ displays as below, but I don't know if there is anything suspicious here. The share in question is 2) \\10.1.2.3\Share

Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Operations (MIDs): 0

6 session 2 share reconnects
Total vfs operations: 133925492 maximum at one time: 11

1) \\10.1.2.3\Share_Archive
SMBs: 53824700 Oplocks breaks: 12
Reads:  699 Bytes: 42507881
Writes: 49175075 Bytes: 801182924574
Flushes: 0
Locks: 12 HardLinks: 0 Symlinks: 0
Opens: 539845 Closes: 539844 Deletes: 156848
Posix Opens: 0 Posix Mkdirs: 0
Mkdirs: 133 Rmdirs: 0
Renames: 0 T2 Renames 0
FindFirst: 21 FNext 28 FClose 0
2) \\10.1.2.3\Share
SMBs: 50466376 Oplocks breaks: 1082284
Reads:  39430299 Bytes: 2255596161939
Writes: 2602 Bytes: 42507782
Flushes: 0
Locks: 1082284 HardLinks: 0 Symlinks: 0
Opens: 2705841 Closes: 2705841 Deletes: 539832
Posix Opens: 0 Posix Mkdirs: 0
Mkdirs: 0 Rmdirs: 0
Renames: 0 T2 Renames 0
FindFirst: 227401 FNext 1422 FClose 0

One pattern I think I see is that the error is raised only if the file in question has been already processed (read and deleted) earlier by the PHP script. There are many files that have been correctly processed and then processed again later, but I have never seen that error for a file that is processed for the first time. The time between re-processing varies from 1 to about 20 days. For re-processing, the file is simply recreated under the same path on the Windows host with updated content.

What can be the problem? How can I investigate better? How can I determine if the problem lies on the PHP or OS side?


Update

I have moved the software that produces the files to a Ubuntu VM that mounts the same shares the same way. This component is coded in Java. I am not seeing any issues when reading/writing to the files.


Update - PHP details

The exact PHP code is:

$strFile = zipPath($intApplicationNumber);

clearstatcache();

if(!file_exists($strFile)){
    return responseInternalError('ZIP file does not exist', $strFile);
}

The intApplicationNumber is a request parameter (eg. 12345678) which is simply transformed to a path by the zipPath() function (eg. \Share\12\345\678.zip - always a full path).

The script may be invoked concurrently with different application numbers, but will not be invoked concurrently with the same application number.

If the script fails (returns the 'ZIP file does not exist' error), it will be called again a minute later. If that fails, it will be permanently marked as failed. Then, usually more than an hour later, I can call the script manually with the same invocation (GET request) that it's done on production and it works fine, the file is found and sent in the response:

public static function ResponseRaw($strFile){
    ob_end_clean();
    self::ReadFileChunked($strFile, false);
    exit;
}

protected static function ReadFileChunked($strFile, $blnReturnBytes=true) {
    $intChunkSize = 1048576; // 1M
    $strBuffer = '';
    $intCount = 0;
    $fh = fopen($strFile, 'rb');

    if($fh === false){
        return false;
    }

    while(!feof($fh)){
        $strBuffer = fread($fh, $intChunkSize);
        echo $strBuffer;
        if($blnReturnBytes){
            $intCount += strlen($strBuffer);
        }
    }

    $blnStatus = fclose($fh);

    if($blnReturnBytes && $blnStatus){
        return $intCount;
    }

    return $blnStatus;
}

After the client receives the file, he notifies the PHP server that the file can be moved to an archive location (by means of copy() and unlink()). That part works fine.


STRACE result

After several days of no errors, the error reappeared. I ran strace and it reports

access("/Share/11/222/333.zip", F_OK) = -1 ENOENT (No such file or directory)

for some files that do exist when I run ls /Share/11/222/333.zip from the command line. Therefore the problem is on the OS level, PHP is not to be blamed.

The errors started appearing when the load on the disk on the host increased (due to other processes), so @risyasin's suggestion below seems most likely - it's a matter of busy resources/timeouts.

I'll try @miguel-svq's advice of skipping the existence test and just going for fopen() right away and handling the error then. I'll see if it changes anything.

Adam Michalik
  • 9,678
  • 13
  • 71
  • 102
  • 1
    Good question. It's not the first time I heard of something like this beeing unreliable. A workaround which helps you a little might be to retry the file_exists and not stopping the script right away. – Daniel W. Feb 18 '16 at 16:07
  • Thanks @DanFromGermany - yeah, it's one of the dirty ideas I had - retry (even after a pause of N seconds) in case it's some sort of temporary hiccup. But I would really want to understand why that happens and fix it in the root. – Adam Michalik Feb 18 '16 at 20:57
  • 3
    I don't really think this is about php but nfs. there can be timeouts or busy resources since that rely on networking. `strace` and `tcpdump` on both sides to see what's actually happening can give you clues. also try with user of php/webserver while testing it. – risyasin Mar 17 '16 at 22:02
  • 1
    Absolutelly agree with @risyasin, probably nfs and not php. Please, If you solve it or find why that happends make us know. I had a similar issue years ago and handled it skipping the file_exists check and directly try/catch open and read the file... – miguel-svq Mar 21 '16 at 21:58
  • I second the others, it is most likely a NFS or SMB2 issue. It may have to do with caching on the Windows server. Have a look for DirectoryCacheLifetime and the comments at https://technet.microsoft.com/en-us/library/ff686200(WS.10).aspx – John P Mar 22 '16 at 08:19
  • @John P - yes, I read that article already, but as far as I understand, those settings apply to caching if windows is the client, right? Anyhow, setting all to zero brought no improvement. I'm out of office this week, will come back with more info (strace logs) next week. Thanks all for the hints. – Adam Michalik Mar 22 '16 at 11:30
  • So how is the PHP script actually notified of a file waiting, and how does it proceed? Will it leave the file in place until it's done processing? Might multiple instances of the PHP script be notified for the same file if processing takes too long? – Gralgrathor Mar 27 '16 at 14:08
  • This may be helpful: https://bugs.php.net/bug.php?id=48062 – Axalix Mar 27 '16 at 23:16
  • Also this: http://www.58shiji.com/upload/jc/php5/php_enhanced_zh/res/function.stat.html#116050 – Axalix Mar 27 '16 at 23:31
  • @Axalix - thanks, I'll try adding `file_exists(realpath($f))` which is the suggested workaround in the first link you provided. The second postis about 64bit inode numbers on 32bit systems - I already ran into this post and I've mounted the share with `noserverino` option. Also, by Ubuntu is 64bit, so it shouldn't be a problem... – Adam Michalik Mar 29 '16 at 07:39
  • @Gralgrathor - I added the details of the PHP code. – Adam Michalik Mar 29 '16 at 08:10
  • 1
    You could try `fopen`ing to see if it exists, maybe it has a better success rate? – toster-cx Mar 31 '16 at 09:33
  • I'm ok with danofgermany. Beside dealing with network resource, error handling should know that network can fail. So you should have some sort of per file error counter and mark them to rerun later in a fallback loop (it's better than to live loop on error because it is more informative) – quazardous Apr 09 '16 at 14:38
  • How about scheduling this at a specific time and date when resources are available? Just a suggestion, but I think quazardous has the right idea. – yardpenalty.com May 06 '16 at 16:08
  • Did it work to remove the exists check and just use fopen right away??? I'm having the same issue. – ADJenks Mar 25 '21 at 18:26

2 Answers2

1

You can try to use the directio option to avoid doing inode data caching on files opened on this mount:

//10.1.2.3/Share /Share cifs credentials=/root/.smbcredentials/share_user,user=share_user,dirmode=0770,filemode=0660,uid=4000,gid=5000,forceuid,forcegid,noserverino,cache=none,directio 0 0
javierfdezg
  • 2,087
  • 1
  • 22
  • 31
  • The [man page](http://linux.die.net/man/8/mount.cifs) says "This option is will be deprecated in 3.7. Users should use cache=none instead on more recent kernels". My kernel is 3.13 and I already have `cache=none`. Does using `directio` make sense then? – Adam Michalik Apr 01 '16 at 06:57
1

This is hardly a definitive answer to my problem, rather a summary of what I found out and what I settled with.

At the bottom of the problem lies that it is the OS who reports that the file does not exist. Running strace shows occasionally

access("/Share/11/222/333.zip", F_OK) = -1 ENOENT (No such file or directory)

for the files that do exist (and show up when listed with ls).

The Windows share host was sometimes under heavy disk load. What I did is move one of the shares to a different host so that the load is spread now between the two. Also, the general load on the system is a bit lighter lately. Whenever I get the error about file not existing, I retry the request some time later and it's no longer there.

Adam Michalik
  • 9,678
  • 13
  • 71
  • 102