Iterating a large amout of files with Batch

Question

I wrote a short batch script that iterates through the files of a directory and its subdirectories. In total there are more than a million files. My batch is working as intented if I use it for smaller numbers of files and directories. But if I try to use it for all of them, it just seems to never stop working. My impression is, that the script needs to "check" every file before I get an output. So my question is: Is there a way to get this done faster or at least to test, if the batch is working at all?

Here is my example code:

FOR /F "delims=*" %%i IN ('dir /s /b *.txt') do echo "test"

Thank you in advance!

MC ND · Accepted Answer · 2019-04-20T18:19:29.013

EDITED to include information discussed in comments

The original answer to this question was

for /r "c:\startingPoint" %%a in (*.txt) do echo %%~fa

which works as intended by the OP: it will recursively process files as they are located in disk, with no wait or pause or at least with no unnecessary pause (of course the first file needs to be found).

What is the difference between the anwswer and the original code

FOR /F "delims=*" %%i IN ('dir /s /b *.txt') do echo "test"

in the question?

In general, for /f is used to iterate over a set of lines instead of a set of files, executing the code in the body of the for command for each of the lines. The in clause of the command defines from "where" to retrieve the set of lines.

This "where" can be a file on disk to be read or a command or set of commands to execute and whose output will be processed. In both cases, all the data is fully retrieved before start processing it. Until all the data is in a memory buffer, the code in the body of the for command is not executed.

And this is where a difference appears.

When a file in disk is read, for /f gets the size of the file and allocates a memory buffer big enough to acomodate the full file in memory, reads the file into the buffer and starts to process the buffer (and of course, you can not use for /f to process a file bigger than free memory)

But when for /f processes a command, it allocates a starting buffer, appends data into it from the stdout stream of the executed command and, when the buffer is full, a new larger buffer is allocated, data from the old buffer is copied to the new buffer and old buffer is discarded. New data is retrieved in the adecuated point of new buffer. And this process is repeated each time the buffer gets full. And this scenario is exacerbated by the fact that the buffer is increased in small amounts.

So, when the data generated by the command is very large, a lot of memory allocation, copy, free is done. And this needs time. For large data, a lot of time.

Summarizing, if for /f is used to process the output of a command and the data to process is large, the time needed to to it will increase exponentially.

How to avoid it? The problem (in this cases) is to retrieve the data from the command, not to process it. So, when the volume of data is really big, instead of the usual for /f %%a in (' command ' ) .... syntax, it is better to execute the command redirecting the output to a temporary file and then use for /f to process the file. The generation of data will need the same amout of time, but the difference in data processing delay can go from hours to seconds or minutes.

Thank you so much, this is exactly what I was looking for. One additional question: Why do you use %%~fa instead of %%a? — Largo, Apr 30 '14 at 06:15
@Daniel, Just to be sure of what i get. the `~f` added to the `%%a` replaceable parameter returns the full path to the file. Depending of the exact syntax/options used in the `for` command, it sometimes returns full paths, sometimes just filenames. Explicitly saying what i want reduce errors if later i change the options in the `for`. — MC ND, Apr 30 '14 at 06:37
It's actually a bug in `For /f` as it parses large lists. With say 500,000 files if the filenames are long enough, it sits and does nothing for far longer periods than parsing a list of filenames would take. The time delay with more filenames is also exponential once the flaw is triggered and can be measured in periods exceeding say 1/2 an hour. — foxidrive, May 04 '14 at 12:12
@foxidrive, the problem happens only when `for /f` processes the output of a command but not when a file is readed. Having to retrieve all data before start processing and not knowing how many data there will be, `for /f` creates a buffer heap memory reallocation is used every time the buffers fills. In each reallocation the buffer is increased but the increment is very small, so there is a lot of reallocation. When `for /f` reads a file, it will create a buffer big enough to hold all the data and then the file is readed into it. This is a lot faster for the same data size/number of lines. — MC ND, May 05 '14 at 12:22
I don't know how cmd processes it but it's still `a bug in for /f` when it can take 1/2 an hour or over an hour to parse a list. People need to be aware that the issue exists for large sets of text, and the workaround for it. — foxidrive, May 05 '14 at 14:15
@foxidrive, i have to agree with you. I've updated the answer to include the information in the comments. Hope this helps someone. — MC ND, May 05 '14 at 17:53

score 0 · Answer 2 · answered Apr 29 '14 at 20:24

Performance: Iterating over all files in a directory and all its subdirectory is not fast. I dont know it for sure but I think that the batch has to check each directory directly on your hard drive. Accessing the hard drive is allways slow. If you want to speed it up, you could use one batch that splits the directories in smaller batches and passes this batches to other scripts who do the real work.

Progress: I dont know the exact answer in terms of syntax but if you use the echo command to display the current File you are iterating you can see if the batch is running.

Problem is, before starting processing at all, `dir /s /b` has to finish - which can eat **much** time - without any chance to output anything. — Stephan, Apr 30 '14 at 07:23

score 0 · Answer 3 · answered Apr 29 '14 at 20:29

0

try like this :

@echo off
FOR /F "delims=*" %%i IN ('dir /s /b *.txt') do (
cls    
echo treating : [%%i]
)
echo Done.....

answered Apr 29 '14 at 20:29

SachaDee

9,245
3
23
33

score 0 · Answer 4 · answered Apr 30 '14 at 05:24

With no clear idea of what you are actually trying to do and one nominal line of code which does nothing of consequence, it's very difficult to make anything other than a general comment.

If you were to change your batchette to

FOR /F "delims=*" %%i IN ('dir /s /b %1*.txt') do echo "test"

and invoke it with

for %%a in (a b c d...x y z 0 1..9) do start yourlittlabatch %%a

(I'll assume you'd have the sense to realise that d..x means all of the characters d to x - I'll not list them for you - and that you'd also need to include in the list any non-alphameric initial characters in use)

then you'd get 36 processes in parallel, each dealing with a portion of the target structure. This should be quicker if you've got a multiprocessor machine - and obviously, I'm also assuming some regular distribution of your filenames' initial character.

The more information you provide, the fewer assumptions we have to make...

foxidrive · Answer 5 · 2014-05-04T12:06:57.563

There is a bug in For /f when it has to process many files, and is made worse with long filenames (the total data being parsed as filenames is the overriding factor). It sits and can wait for over an hour simply parsing the list.

The solution is to use the dir command into a file and then use the file in the for /f command.

dir /s /b /a-d *.txt >file.tmp

FOR /F "delims=*" %%i IN (file.tmp) do echo "test"

Iterating a large amout of files with Batch

5 Answers5

Linked