Another approach to apply RIPEMD in CSV file

Question

I am looking for another approach to apply RIPEMD-160 to the second column of a csv file.

Here is my code

awk -F "," -v env_var="$key" '{
    tmp="echo -n \047" $2 env_var "\047 | openssl ripemd160 | cut -f2 -d\047 \047"
    if ( (tmp | getline cksum) > 0 ) {
        $3 = toupper(cksum)
    }
    close(tmp)
    print
}' /test/source.csv > /ziel.csv

I run it in a big csv file (1Go), it takes 2 days and I get only 100Mo, that means i need to wait a month to get all my new CSV.

Can you help me with another idea and approach to get my data faster.

Thanks in advance

Fo starters, you should benchmark your current code with `time`(e.g. `time awk -F."," -v..`) and the theoretical limit of your CPU with `openssl speed ripemd160`. It will help you knowing how far you are from what you can achieve. And if you are close to the maximum speed, there’s not much you can do to optimize. — vdavid, Apr 06 '17 at 12:39
lost of time is not realy the openssl but more the forking and IO to shell inside awk and in limited ressource used (no parallel, 1 fork at a time). Optimisation will normaly return a much much better time than a linear reduction (even it's already certainly a good gain) — NeronLeVelu, Apr 06 '17 at 13:34
It would be helpful to have a few lines from the CSV file and the expected output you want. — Ole Tange, Apr 09 '17 at 09:20

score 1 · Answer 1 · answered Apr 06 '17 at 12:01

1

you can use GNU Parallel to increase the speed of output by executing the awk command in parallel For explanation check here

cat /test/source.csv | parallel --pipe awk -F "," -v env_var="$key" '{
    tmp="echo -n \047" $2 env_var "\047 | openssl ripemd160 | cut -f2 -d\047 \047"
    if ( (tmp | getline cksum) > 0 ) {
        $3 = toupper(cksum)
    }
    close(tmp)
    print
}' > /ziel.csv

answered Apr 06 '17 at 12:01

VIPIN KUMAR

3,019
1
23
34

I am working with Cygwin, i am getting parallel not found. can I install it and work with it within cygwin ? – Houssem Apr 06 '17 at 12:41
check this link - http://stackoverflow.com/questions/37212894/which-cygwin-package-to-get-parallel-command – VIPIN KUMAR Apr 06 '17 at 12:48
Not all versions of parallel support the `--pipe` option. Make sure to install a parallel that supports it. – vdavid Apr 06 '17 at 12:53
@Houssem - Lets ask for help from parallel cmd developer. – VIPIN KUMAR Apr 06 '17 at 13:03
@OleTange - Can you please provide your valuable suggestion to solve this question. – VIPIN KUMAR Apr 06 '17 at 13:03

NeronLeVelu · Answer 2 · 2017-04-07T08:13:48.097

0

# prepare a batch (to avoir fork from awk)
awk -F "," -v env_var="$key" '
    BEGIN {
       print "if [ -r /tmp/MD160.Result ];then rm /tmp/MD160.Result;fi"
       }
    {
    print "echo \"\$( echo -n \047" $2 env_var "\047 | openssl ripemd160 )\" >> /tmp/MD160.Result"
    } ' /test/source.csv > /tmp/MD160.eval

# eval the MD for each line with batch fork (should be faster)
. /tmp/MD160.eval

# take result and adapt for output
awk '
   # load MD160
   FNR == NR { m[NR] = toupper($2); next }
   # set FS to ","
   FNR == 1 { FS = ","; $0 = $0 "" }
   # adapt original line
   { $3 = m[FNR]; print}
   ' /tmp/MD160.Result /test/source.csv   > /ziel.csv

Note:

not tested (so the print need maybe some tuning with escape)
no error treatment (assume everything is ok). I advice to make some test (like inclunding line reference in reply and test in second awk).
fork at batch level will be lot more faster than fork from awk including piping fork, catching the reply
not a specialist of openssl ripemd160 but there is maybe another way to treat element in a bulk process without opening everytime a fork from same file/source

edited Apr 07 '17 at 08:13

answered Apr 06 '17 at 11:39

NeronLeVelu

9,908
1
23
43

Sorry but it doesn't work, for each line of eval I am getting C:/hanatest/MD160.eval: line 1: (stdin)=: command not found – Houssem Apr 06 '17 at 12:39
missing a first echo before `$(...) >>`, adapted – NeronLeVelu Apr 06 '17 at 13:28
can you please write it again, i don't really inderstand where should i write the echo , thanks in advance – Houssem Apr 06 '17 at 13:39
the code is already adapted, so just copy/paste the current version – NeronLeVelu Apr 06 '17 at 14:24
what is slower, i test here `time` give me 33% gain on small file and nearly 50% on bigger file – NeronLeVelu Apr 07 '17 at 08:23

Ole Tange · Answer 3 · 2017-04-09T18:26:08.643

Your solution hits Cygwin where it hurts the most: Spawning new programs. Cygwin is terrible slow at this.

You can make this faster by using all cores in you computer, but it will still be very slow.

You need a program that does not start other programs to compute the RIPEMD sum. Here is a small Python script that takes the CSV on standard input and outputs the CSV on standard output with the second column replaced with the RIPEMD sum.

riper.py:

#!/usr/bin/python                                                                                  

import hashlib
import fileinput
import os

key = os.environ['key']

for line in fileinput.input():
    # Naiive CSV reader - split on ,                                                               
    col = line.rstrip().split(",")
    # Compute RIPEMD on column 2                                                                   
    h = hashlib.new('ripemd160')
    h.update(col[1]+key)
    # Update column 2 with the hexdigext                                                           
    col[1] = h.hexdigest().upper();
    print ','.join(col)

Now you can run:

cat source.csv | key=a python riper.py > ziel.csv

This will still only use a single core of your system. To use all core GNU Parallel can help. If you do not have GNU Parallel 20161222 or newer in your package system, it can be installed as:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

You will need Perl installed to run GNU Parallel:

key=a
export key
parallel --pipe-part --block -1 -a source.csv -k python riper.py > ziel.csv

This will on the fly chop source.csv into one block per CPU core and for each block run the python script. On my 8 core this processes a 1 GB file with 139482000 lines in 300 seconds.

If you need it faster still, you will need to convert riper.py to a compiled language (e.g. C).

Another approach to apply RIPEMD in CSV file

3 Answers3