How to remove the path part from a list of files and copy it into another file?

Question

I need to accomplish the following things with bash scripting in FreeBSD:

Create a directory.
Generate 1000 unique files whose names are taken from other random files in the system.
Each file must contain information about the original file whose name it has taken - name and size without the original contents of the file.
The script must show information about the speed of its execution in ms.

What I could accomplish was to take the names and paths of 1000 unique files with the commands find and grep and put them in a list. Then I just can't imagine how to remove the path part and create the files in the other directory with names taken from the list of random files. I tried a for loop with the basename command in it but somehow I can't get it to work and I don't know how to do the other tasks as well...

In my long answer below I forgot to add "information about the speed of execution". You can just run the script using ```time``` like this: ```time ./1000_files.sh``` It will give you output like this: ```4.007u 16.984s 0:15.18 138.2% 67+2117k 0+2000io 0pf+0w``` — G. Cito, May 30 '13 at 15:13

score 3 · Answer 1 · edited May 23 '17 at 12:08

[Update: I've wanted to come back to this question to try to make my response more useful and portable across platforms (OS X is a Unix!) and $SHELLs, even though the original question specified bash and zsh. Other responses assumed a temporary file listing of "random" file names since the question did not show how the list was constructed or how the selection was made. I show one method for constructing the list in my response using a temporary file. I'm not sure how one could randomize the find operation "inline" and hope someone else can show how this might be done (portably). I also hope this attracts some comments and critique: you never can know too many $SHELL tricks. I removed the perl reference, but I hereby challenge myself to do this again in perl and - because perl is pretty portable - make it run on Windows. I will wait a while for comments and then shorten and clean up this answer. Thanks.]

Creating the file listing

You can do a lot with GNU find(1). The following would create a single file with the file names and three, tab-separated columns of the data you want (name of file, location, size in kilobytes).

find / -type f -fprintf tmp.txt '%f\t%h/%f\t%k \n'

I'm assuming that you want to be random across all filenames (i.e. no links) so you'll grab the entries from the whole file system. I have 800000 files on my workstation but a lot of RAM, so this doesn't take too long to do. My laptop has ~ 300K files and not much memory, but creating the complete listing still only took a couple minutes or so. You'll want to adjust by excluding or pruning certain directories from the search.

A nice thing about the -fprintf flag is that it seems to take care of spaces in file names. By examining the file with vim and sed (i.e. looking for lines with spaces) and comparing the output of wc -l and uniq you can get a sense of your output and whether the resulting listing is sane or not. You could then pipe this through cut, grep or sed, awk and friends in order to to create the files in the way you want. For example from the shell prompt:

~/# touch `cat tmp.txt |cut -f1` 
~/# for i in `cat tmp.txt|cut -f1`; do cat tmp.txt | grep $i > $i.dat ; done

I'm giving the files we create a .dat extension here to distinguish them from the files to which they refer, and to make it easier to move them around or delete them, you don't have to do that: just leave off the extension $i > $i.

The bad thing about the -fprintf flag is that it is only available with GNU find and is not a POSIX standard flag so it won't be available on OS X or BSD find(1) (though GNU find may be installed on your Unix as gfind or gnufind). A more portable way to do this is to create a straight up list of files with find / -type f > tmp.txt (this takes about 15 seconds on my system with 800k files and many slow drives in a ZFS pool. Coming up with something more efficient should be easy for people to do in the comments!). From there you can create the data values you want using standard utilities to process the file listing as Florin Stingaciu shows above.

#!/bin/sh

# portably get a random number (OS X, BSD, Linux and $SHELLs w/o $RANDOM)
randnum=`od -An -N 4 -D < /dev/urandom` ; echo $randnum


  for file in `cat tmp.txt`
   do
      name=`basename $file`
      size=`wc -c $file |awk '{print $1}'`

# Uncomment the next line to see the values on STDOUT 
#      printf "Location: $name \nSize: $size \n"

# Uncomment the next line to put data into the respective .dat files 
#      printf "Location: $file \nSize: $size \n" > $name.dat

 done

# vim: ft=sh

If you've been following this far you'll realize that this will create a lot of files - on my workstation this would create 800k of .dat files which is not what we want! So, how to randomly select 1000 files from our listing of 800k for processing? There's several ways to go about it.

Randomly selecting from the file listing

We have a listing of all the files on the system (!). Now in order to select 1000 files we just need to randomly select 1000 lines from our listing file (tmp.txt). We can set an upper limit of the line number to select by generating a random number using the cool od technique you saw above - it's so cool and cross-platform that I have this aliased in my shell ;-) - then performing modulo division (%) on it using the number of lines in the file as the divisor. Then we just take that number and select the line in the file to which it corresponds with awk or sed (e.g. sed -n <$RANDOMNUMBER>p filelist), iterate 1000 times and presto! We have a new list of 1000 random files. Or not ... it's really slow! While looking for a way to speed up awk and sed I came across an excellent trick using dd from Alex Lines that searches the file by bytes (instead of lines) and translates the result into a line using sed or awk. See Alex's blog for the details. My only problems with his technique came with setting the count= switch to a high enough number. For mysterious reasons (which I hope someone will explain) - perhaps because my locale is LC_ALL=en_US.UTF-8 - dd would spit incomplete lines into randlist.txt unless I set count= to a much higher number that the actual maximum line length. I think I was probably mixing up characters and bytes. Any explanations?

So after the above caveats and hoping it works on more than two platforms, here's my attempt at solving the problem:

#!/bin/sh
IFS='
'                                                                                
# We create tmp.txt with                                                        
# find / -type f > tmp.txt  # tweak as needed.                                  
#                                                                               
files="tmp.txt"                                                           

# Get the number of lines and maximum line length for later                                                                              
bytesize=`wc -c < $files`                                                 
# wc -L is not POSIX and we need to multiply so:
linelenx10=`awk '{if(length > x) {x=length; y = $0} }END{print x*10}' $files`

# A function to generate a random number modulo the                             
# number of bytes in the file. We'll use this to find a                         
# random location in our file where we can grab a line                          
# using dd and sed. 

genrand () {                                                                    
  echo `od -An -N 4 -D < /dev/urandom` ' % ' $bytesize | bc                     
}                                                                               

rm -f randlist.txt                                                             

i=1                                                                             
while [ $i -le 1000 ]                                                          
do                             
 # This probably works but is way too slow: sed -n `genrand`p $files                
 # Instead, use Alex Lines' dd seek method:
 dd if=$files skip=`genrand` ibs=1 count=$linelenx10 2>/dev/null |awk 'NR==2 {print;exit}'>> randlist.txt

 true $((i=i+1))    # Bourne shell equivalent of $i++ iteration    
done  

for file in `cat randlist.txt`                                                 
  do                                                                           
   name=`basename $file`                                                        
   size=`wc -c <"$file"`                                 
   echo -e "Location: $file \n\n Size: $size" > $name.dat  
  done    

# vim: ft=sh

Note that the huge file listing we create with "find / -type f " includes **all** the files on the system, some of which are temporary files that will no longer exist when we do our random selection. This would especially be the case on a busy server or system or where user's caches (from ```chrome``` or ```firefox``` are being frequently updated. In this case the ```*.dat``` file still gets created but with missing file size information. The final script could add a check for this error. — G. Cito, May 30 '13 at 14:58
Some references used for this answer: http://stackoverflow.com/questions/701505/best-way-to-choose-a-random-file-from-a-directory-in-a-shell-script http://stackoverflow.com/questions/414164/how-can-i-select-random-files-from-a-directory-in-bash http://mywiki.wooledge.org/ParsingLs http://tumblr.machinetext.com/post/4997828856/selecting-a-random-line-from-a-file Also see http://www.commandlinefu.com don't keep your oneliners hidden on your hard drive, put them on the interwebs! OTOH I've learned a lot by remaking the same errors because I didn't have my ```oneliners.txt``` with me ;-) — G. Cito, May 30 '13 at 15:06
Could replace ```size=`wc -c "$file" |awk '{print $1}'` ``` with ```size=`wc -c <"$file"` ``` as glenn jackman notes above. — G. Cito, May 30 '13 at 16:39
Thanks Radix :-) Another tweak would to remove ```printf``` which seems like it might be sensitive to Unicode characters appearing the ```$name``` variable. If all we're doing is embedding newlines in the string being "```echo```ed" then ```echo -e``` is a better choice. In fact it's so obvious that I'm going to have to make a change to the script :-) — G. Cito, May 31 '13 at 18:45

Florin Stingaciu · Accepted Answer · 2013-05-24T16:05:02.783

1

What I could accomplish was to take the names and paths of 1000 unique files with the commands "find" and "grep" and put them in a list

I'm going to assume that there is a file that holds on each line a full path to each file (FULL_PATH_TO_LIST_FILE). Considering there's not much statistics associated with this process, I omitted that. You can add your own however.

cd WHEREVER_YOU_WANT_TO_CREATE_NEW_FILES
for file_path in `cat FULL_PATH_TO_LIST_FILE`
do
     ## This extracts only the file name from the path
     file_name=`basename $file_path`

     ## This grabs the files size in bytes
     file_size=`wc -c < $file_path`

     ## Create the file and place info regarding original file within new file
     echo -e "$file_name \nThis file is $file_size bytes "> $file_name

done

edited May 24 '13 at 16:05

answered May 23 '13 at 19:34

Florin Stingaciu

8,085
2
24
45

Isn't that a complicated way to do what `basename` does? – Barmar May 23 '13 at 19:37
@Barmar Extracting the file name could've been done using basename. I didn't know about it until now. – Florin Stingaciu May 23 '13 at 19:39
He mentioned it in the question. – Barmar May 23 '13 at 19:39
1

There are many bad practices in your sample code that you are using to teach others. First, this doesn't work if a file contains a space. It attempts to parse the output of `ls`, which is unreliable. It opens the same file twice (which can't have a space in the name), just to write 2 lines. – jordanm May 23 '13 at 19:45
`wc -c "$file"` gives you the name of the file in the output. Use either `wc -c <"$file"` or `stat -c %s "$file"` – glenn jackman May 23 '13 at 21:30
```wc -c <"$file"``` is more portable than anything you can do with ```stat``` which has to be the most incompatible of all system utility commands to be based on a POSIX standard system call towhit: ```stat()``` and ```fstat()```. Comparing the ```stat(2) stat(1)``` manual pages across different platforms tells the tale. – G. Cito May 30 '13 at 16:29

How to remove the path part from a list of files and copy it into another file?

2 Answers2