10

I have a large number of audio files I am running through a processing algorithm to attempt to extract certain bits of data from it (ie: average volume of the entire clip). I have a number of build scripts that previously pulled the input data from a Samba network share, which I've created a network drive mapping to via net use (ie: M: ==> \\server\share0).

Now that I have a new massive 1TB SSD, I can store the files locally and process them very quickly. To avoid having to do a massive re-write of my processing scripts, I removed my network drive mapping, and re-created it using the localhost host name. ie: M: ==> \\localhost\mydata.

When I make use of such a mapping, do I risk incurring significant overhead, such as from the data having to travel through part of Windows' network stack, or does the OS use any shortcuts so it equates more-or-less to direct disk access (ie: does the machine know it's just pulling files from its own hard drive). Increased latency isn't much of a concern of mine, but maximum sustained average throughput is critical.

I ask this because I'm deciding whether or not I should modify all of my processing scripts to work with a different style for network paths.

Extra Question: Does the same apply to Linux hosts: are they smart enough to know they are pulling from a local disk?

pnuts
  • 58,317
  • 11
  • 87
  • 139
Cloud
  • 18,753
  • 15
  • 79
  • 153
  • 1
    Throughput will be affected to some extent. On a spinning drive, the increased per-file overhead accounts for most of the performance loss, so if you are dealing with a small number of large files it probably won't be noticeable. On an SSD I don't know. Try it and see! – Harry Johnston Nov 18 '15 at 21:21
  • 2
    ... but the *best* solution in this case is probably to use `subst` to assign a drive letter to the folder. The overhead on that is negligible, and the network stack is not involved. – Harry Johnston Nov 18 '15 at 21:22
  • 2
    Question seems to assume two possible answers: **yes** the OS optimizes access to drive mapped to local share and **no** it doesn't. But it's not that simple. There are many layers in the stack between your app and it's data. On Linux and Windows there will be some optimization at network layer for a local network connection (at the very least the MAC layer and below are avoided). However it's certainly the case that the code path to a local drive vs a mapped network drive won't be the same. To Harry's point app's behavior can create a significant delta. Bottom line: benchmark to know for sure – Χpẘ Nov 19 '15 at 02:54
  • Doesn't this depend on what methods you're using to read from? Can you post the method where you're fetching the data? – Christopher Bales Dec 02 '15 at 21:25
  • @ChristopherBales I would be copying files via shell scripts using the XCOPY command, and passing a network mapped drive created either via `subst` or `net use`, if one is faster than the other. – Cloud Dec 03 '15 at 01:32

2 Answers2

5

When I make use of such a mapping, do I risk incurring significant overhead,

Yes. By using an UNC path (\\hostname\sharename\filename) as opposed to a local path ([\\?\]driveletter:\directoryname\filename), you're letting all traffic occur through the Server Message Block protocol (SMB / Samba). This adds a significant overhead in terms of disk access and access times in general.

The flow over a network is like this:

Application -> SMB Client -> Network -> SMB Server -> Target file system

Now by moving your files to your local machine, but still using UNC to access them, the flow is like this:

Application -> SMB Client -> localhost -> SMB Server -> Target file system

The only thing you minimized (not eliminated, SMB traffic to localhost still involves the network layers and all computations and traffic associated) is network traffic.

Also, given SMB is specifically tailored for network traffic, its reads may not optimally use your disk's and OS's caches. It may for example perform its reads in blocks of a certain size, while your disk performs better when reading blocks of another size.

If you want optimal throughput and minimal access times, use as little layers in between as possible, in this case by directly accessing the filesystem:

Application -> Target file system
CodeCaster
  • 147,647
  • 23
  • 218
  • 272
  • Does using a drive mapping via `subst` or `net use` remove some of that overhead compared to a UNC path? Also, is `subst` different from or better than `net use` in any way? Thank you. – Cloud Dec 03 '15 at 19:24
  • 1
    No, the transfer will still happen over SMB over TCP through localhost, where most of the additional bottleneck lies. Subst and net use are only used to create different representations of the same resource, they don't fundamentally change how that resource is accessed. AFAIK though, can't look that up ATM. – CodeCaster Dec 03 '15 at 19:33
  • 1
    Thank you. This is exactly what I was looking for. – Cloud Dec 03 '15 at 19:42
  • Could you please provide an example for the second case, `[\\.\]driveletter:\directoryname\filename`. I tried ` \\.\c:\ ` to access my "C" drive, but the syntax appears to be invalid. – Cloud Jan 04 '16 at 17:17
  • @Dogbert my bad, see http://stackoverflow.com/questions/21194530/what-does-mean-when-prepended-to-a-file-path. In your case it's `\\?\C:`. – CodeCaster Jan 04 '16 at 17:24
  • Thank you. One last question: the example above works when I use the "Start ==> Run..." dialog box, but when I try to map a network drive via the Windows Explorer "Map Network Drive" or CLI "net use" command, it fails. Is the notation different in those cases? – Cloud Jan 04 '16 at 17:27
  • Ah, in that case, I guess I just prefix my local paths with `\\?\C:\` instead in my scripts. Thanks! – Cloud Jan 04 '16 at 17:29
  • You generally don't need that syntax anyway. I just mentioned it for completeness's sake. – CodeCaster Jan 04 '16 at 17:30
4

For sure using TCP over direct file access even with "loopback" has overheads such as routing, memory allocations etc. both on linux and windows, yes loopback device is a non-physichal kernel device and faster than the other network devices but not faster than direct file access. As far as I know on windows there are additional loopback optimizations such as NetDNA and "Fast TCP Loopback".

I assume the bottleneck with loopback device will be memory (copy) processes. So directly accessing a file rather than over loopback device will always be faster (and low-resource consuming) both on linux and windows.

Additionally, both operating systems solves protocol overheads for IPC via "named pipes" on windows and "unix domain sockets" on linux, using these will also be faster than using the loopback device whenever applicable.

mow
  • 271
  • 2
  • 6