Fast ways to make new multiple files from one file matching multiple patterns

Question

I have one file called uniq.txt (20,000 lines).

head uniq.txt 
1 
103 
10357 
1124 
1126

I have another file called all.txt (106,371,111 lines)

head all.txt
cg0001  ?   1   -0.394991215660192 
cg0001  AB  103 -0.502535661820095 
cg0002  A   10357   -0.563632386999913 
cg0003  ?   1   -0.394991215660444 
cg0004  ?   1   -0.502535661820095 
cg0004  A   10357   -0.563632386999913 
cg0003  AB  103 -0.64926706504459

I would like to make new 20,000 files from all.txt matching each line pattern of uniq.txt. For example,

head 1.newfile.txt 
cg0001  ?   1   -0.394991215660192 
cg0003  ?   1   -0.394991215660444 
cg0004  ?   1   -0.502535661820095 

head 103.newfile.txt 
cg0001  AB  103 -0.502535661820095 
cg0003  AB  103 -0.64926706504459 

head 10357.newfile.txt 
cg0002  A   10357   -0.563632386999913 
cg0004  A   10357   -0.563632386999913

Is there any way that I can make new 20,000 files really fast? My current script takes 1 min to make one new file. I guess it's scanning all.txt file every time it makes a new file.

Why do you even need `uniq.txt`? It seems you could generate the files just from `all.txt`... — l'L'l, Apr 25 '16 at 00:22
Is there any way that I can make newfiles from all.txt fast? Could you please help? I am a beginner >, — sony, Apr 25 '16 at 00:25
If i do my math correctly, then it should take over 333 hours and 20 mins. — Patton Pierce, Apr 25 '16 at 00:50
See also http://stackoverflow.com/questions/36630739/split-10-billion-line-file-into-5-000-files-by-column-value-in-perl-or-python — reinierpost, Apr 25 '16 at 08:08

score 2 · Accepted Answer · answered Apr 25 '16 at 00:58

You can try it with awk. Ideally you don't need >> in awk but since you have stated there would be 20,000 files, we don't want to exhaust system's resources by keeping too many file open.

awk '
    NR==FNR { names[$0]++; next }
    ($3 in names) { file=$3".newfile.txt"; print $0 >>(file); close (file) }
' uniq.txt all.txt

This will first scan the uniq.txt file into memory creating a lookup table of sorts. It will then read through the all.txt file and start inserting entries into corresponding files.

way to go Jaypal. @sony this will be a very efficient solution to your problem. Good luck to all ;-)! — shellter, Apr 25 '16 at 02:23

score 0 · Answer 2 · edited May 23 '17 at 12:31

This uses a while loop — This may or may not be the quickest way, although give it a try:

lines_to_files.sh

#!/bin/bash

while IFS='' read -r line || [[ -n "$line" ]]; do
    num=$(echo "$line" | awk '{print $3}') 
    echo "$line" >> /path/to/save/${num}_newfile.txt
done < "$1"

usage:

$ ./lines_to_files.sh all.txt

This should create a new file for each line in your all.txt file based on the third column. As it reads each line it will add it to the appropriate file. Keep in mind that if you run the script successive times it will append the data that is already there for each file.

An explanation of the while loop used above for reading the flie can be found here:

↳ https://stackoverflow.com/a/10929511/499581

Thank you for your help, @I'L'l – sony Apr 28 '16 at 02:05 — sony, Apr 28 '16 at 02:05

score 0 · Answer 3 · answered Apr 25 '16 at 03:59

You can read each line into a Bash array, then append to the file named after the number in column three (array index 2):

#!/bin/bash

while read -ra arr; do
    echo "${arr[@]}" >> "${arr[2]}".newfile.txt
done < all.txt

This creates space separated output. If you prefer tab separated, it depends a bit on your input data: if it is tab separated as well, you can just set IFS to a tab to get tab separated output:

IFS=$'\t'
while read -ra arr; do
    echo "${arr[*]}" >> "${arr[2]}".newfile.txt
done < all.txt

Notice the change in printing the array, the * is now actually required.

Or, if the input data is not tab separated (or we don't know), we can set IFS in a subshell in each loop:

while read -ra arr; do
    ( IFS=$'\t'; echo "${arr[*]}" >> "${arr[2]}".newfile.txt )
done < all.txt

I'm not sure what's more expensive, spawning a subshell or a few parameter assignments, but I feel it's the subshell – to avoid spawning it, we can set and reset IFS in each loop instead:

while read -ra arr; do
    old_ifs="$IFS"
    IFS=$'\t'
    echo "${arr[*]}" >> "${arr[2]}".newfile.txt
    IFS="$old_ifs"
done < all.txt

Thank you for your suggestion. Your way certainly works, but awk turned out to be the fastest @jaypal singh — sony, Apr 28 '16 at 02:02

webb · Answer 4 · 2016-04-29T18:39:24.343

0

OP asked for fast ways. This is the fastest I've found.

sort -S 4G -k3,3 all.txt |
  awk '{if(last!=$3){close(file); file=$3".newfile.txt"; last=$3} print $0 > file}'

Total time was 2m4.910s vs 10m4.058s for the runner-up. Note that it uses 4 GB of memory (possibly faster if more, definitely slower if less) and that it ignores uniq.txt.

Results for full-sized input files (100,000,000-line all.txt, 20,000-line uniq.txt):

sort awk write             me  ~800,000 input lines/second
awk append      @jaypal-singh  ~200,000 input lines/second
bash append       @benjamin-w   ~15,000 input lines/second
bash append + extra awk  @lll     ~2000 input lines/second

Here's how I created the test files:

seq 1 20000 | sort -R | sed 's/.*/cg0001\tAB\t&\t-0.502535661820095/' > tmp.txt
seq 1 5000 | while read i; do cat tmp.txt; done > all.txt
seq 1 20000 | sort -R > uniq.txt

PS: Apologies for the flaw in my original test setup.

edited Apr 29 '16 at 18:39

answered Apr 25 '16 at 07:51

webb

4,180
1
17
26

Thank you for suggesting @webb. However, I cannot make it to work using your way. I tried the following, while read -ra arr; do > old_ifs="$IFS" > IFS=$'\t' > grep "${arr[*]}" all.txt > "${arr[2]}".newfile.txt > done < all.txt This way took so long.. Did I understand your suggestion correctly? – sony Apr 28 '16 at 01:59
@sony, sorry, there was a typo. fixed. also updated to work instead with tab-delimited `all.txt`. it works on my test files just as it is, no `arr`s. – webb Apr 28 '16 at 03:38
I'm actually surprised that awk is just 30% faster than a Bash `while` loop. Your solution processes the 1,000,000 lines long `all.txt` 20,000 times - is grep *that* fast? Or is closing the file handles in awk so expensive? – Benjamin W. Apr 28 '16 at 04:21
thanks for leading me to find an error in how i setup my first tests! – webb Apr 28 '16 at 21:46

Fast ways to make new multiple files from one file matching multiple patterns

4 Answers4