2

I have a file, users.txt, with words like,

user1
user2
user3

I want to find these words in another file, data.txt and add a prefix to it. data.txt has nearly 500K lines. For example, user1 should be replaced with New_user1 and so on. I have written simple shell script like

for user in `cat users.txt`
do
    sed -i 's/'${user}'/New_&/' data.txt
done

For ~1000 words, this program is taking minutes to process, which surprised me because sed is very fast when to comes to find and replace. I tried to refer to Optimize shell script for multiple sed replacements, but still not much improvement was observed.

Is there any other way to make this process faster?

Community
  • 1
  • 1
user3150037
  • 172
  • 1
  • 10

3 Answers3

5

Sed is known to be very fast (probably only worse than C).

Instead of sed 's/X/Y/g' input.txt, try sed '/X/ s/X/Y/g' input.txt. The latter is known to be faster.

Since you only have a "one line at a time semantics", you could run it with parallel (on multi-core cpu-s) like this:

cat huge-file.txt | parallel --pipe sed -e '/xxx/ s/xxx/yyy/g'

If you are working with plain ascii files, you could speed it up by using "C" locale:

LC_ALL=C sed -i -e '/xxx/ s/xxx/yyy/g' huge-file.txt
blackpen
  • 2,339
  • 13
  • 15
3

You can turn your users.txt into sed commands like this:

$ sed 's|.*|s/&/New_&/|' users.txt 
s/user1/New_user1/
s/user2/New_user2/
s/user3/New_user3/

And then use this to process data.txt, either by writing the output of the previous command to an intermediate file, or with process substitution:

sed -f <(sed 's|.*|s/&/New_&/|' users.txt) data.txt

Your approach goes through all of data.txt for every single line in users.txt, which makes it slow.

If you can't use process substitution, you can use

sed 's|.*|s/&/New_&/|' users.txt | sed -f - data.txt

instead.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
  • Thanks for quick answer Benjamin :). I have tried this approach but still it takes nearly 1 min to finish for ~1000 entries in users.txt – user3150037 Oct 04 '16 at 17:18
  • 1
    @user3150037 I don't think you can get much faster with sed - it still has to go through all of `data.txt` and try all substitutions. A faster approach would be to find a pattern that describes all of the words in `users.txt`, then you could work with just one substitution. We'd have to see more of `users.txt` for that, though, with the real data. – Benjamin W. Oct 04 '16 at 17:19
  • users.txt is real data but with lot of entries and data.txt has also similar data but users range is very high (~500K). – user3150037 Oct 04 '16 at 17:28
  • @user3150037 Then I don't think sed can get you anything much faster. Awk or Perl are often faster. – Benjamin W. Oct 04 '16 at 17:30
  • How about conflicting names line jami, ben and benjamin ? – Walter A Oct 04 '16 at 21:37
  • @WalterA Depends on the actual contents of `users.txt` and `data.txt` - could use word boundary anchors or the like if I knew how they appear in each. – Benjamin W. Oct 04 '16 at 21:56
1

Or.. in one go, we can do something like this. Let us say, we have a data file with 500k lines.

$>    
wc -l data.txt
500001 data.txt

$>    
ls -lrtha data.txt
-rw-rw-r--. 1 gaurav gaurav 16M Oct  5 00:25 data.txt

$>
head -2 data.txt  ; echo ; tail -2 data.txt
0|This is a test file maybe
1|This is a test file maybe

499999|This is a test file maybe
500000|This is a test file maybe

Let us say that our users.txt has 3-4 keywords, which are to be prefixed with "ab_", in the file "data.txt"

$>    
cat users.txt
file
maybe
test

So we want to read users.txt and for every word, we want to change that word to a new word. For ex., "file" to "ab_file", "maybe" to "ab_maybe"..

We can run a while loop, read the input words to be prefixed one by one, and then we run a perl command over the file with the input word stored in a variable. In below example, read word is passed to perl command as $word.

I timed this task and this happens fairly quickly. Did it on my VM hosted on my windows 10 (using Centos7).

time cat users.txt |while read word; do  perl -pi -e "s/${word}/ab_${word}/g" data.txt; done        
real    0m1.973s
user    0m1.846s
sys     0m0.127s
$>    
head -2 data.txt  ; echo ; tail -2 data.txt
0|This is a ab_test ab_file ab_maybe
1|This is a ab_test ab_file ab_maybe

499999|This is a ab_test ab_file ab_maybe
500000|This is a ab_test ab_file ab_maybe

In above code, we read the words: test, file, maybe and changed it to ab_test, ab_file, ab_maybe in the data.txt file. head and tail count confirms our operation.

cheers, Gaurav

blackpen
  • 2,339
  • 13
  • 15
User9102d82
  • 1,172
  • 9
  • 19