0

I have these in a file under CentOS:

real1 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
real2 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
real3 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
173corr 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
512corr 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
513corr 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

There are two blocks "real" and "corr" though each block may contain multiple subcontents, i.e. real1, real2 etc.

I would like the subcontents of each block being joined. The output will looks like:

real1 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
173corr 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

To accomplish this in Editplus, take the real block as example, I need to highlight the whole real block, and find all \nreal\d+\n occurrences and replace with \t\t.

The challenges are:

  1. How to highlight multiple lines in sed. For example, there is one real block starting from line 5 to line 10, and another real block from 30 to 50. Each of the real blocks will be highlighted and performed the same replacement block by block in Editplus. I don't know if sed can do all at once. If not, designate and perform replacement on each block is ok.

  2. The header of each subcontents is in name+digit format, i.e. real1, real2 and so on. So I add \d+ in my trial on CentOS, but it seems not working.

I know this is a very complex problem. I just hope sed can do the trick.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Jonathan Zhou
  • 89
  • 1
  • 5
  • Is there any relationship in names ? You'd need that and to relate them to the right bin, there has to be an equal amount of sub blocks. Otherwise, its pointless. If the relationship has to do with the count, it's probably better to match each name of a block into seperate arrays, then match them up based on _index_ ? –  Oct 30 '20 at 20:58
  • Are your `real\d+` and `\d+corr` lines together in file? – anubhava Oct 31 '20 at 05:53

3 Answers3

1

I'm sure sed can do the trick, I just don't do sed very well ... how about an ugly awk script?

$ cat block
real1 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
real2 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
real3 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
173corr 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
512corr 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
513corr 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

Here the script:

$ cat block.awk
BEGIN{
    block=""
}
{
    newblock=gensub(/[0-9]*([a-z]+)[0-9]*/,"\\1","1",$1)
    if( newblock != block ){
        if(NR>1){print ""}
        for( i=1; i<=NF; i++){
            printf "%s ", $i
        }
    } else {
        for( i=2; i<=NF; i++){
            printf "%s ", $i
        }
    }
    block=newblock
}
END{
    print ""
}

And the outcome:

$ awk -f block.awk block
real1 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 
173corr 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
tink
  • 14,342
  • 4
  • 46
  • 50
  • I don't mind about sed or awk if it can work. :) but how can I test your solution on my end. I don't know awk script at all. Should I put your code in a .sh file? – Jonathan Zhou Oct 30 '20 at 21:03
  • No, put it in any file name, and invoke it like I did in `awk -f block.awk block` ... I just called it block.awk to make it match the task at hand. – tink Oct 30 '20 at 21:21
  • [root@localhost nn]# awk -f block.awk excel.log real1 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 real2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 – Jonathan Zhou Oct 30 '20 at 21:45
  • The lines aren't joined. – Jonathan Zhou Oct 30 '20 at 21:45
  • @JonathanZhou thinks he must give the complete block as parameters to the command. (That would have been possible with `echo "real1 0,5 .... 0.5" | awk ...`) @tink started his solution showing a file named `block` and his command needs to have the filename as the last parameter. – Walter A Oct 31 '20 at 18:32
  • 1
    I'm not sure, @WalterA ... my suspicion is that Jonathan's actual files' content doesn't match the sample data supplied. – tink Oct 31 '20 at 20:59
1

This might work for you (GNU sed):

sed -E ':a;N;s/^(.*(real|corr).*)\n.*\2\S*/\1/;ta;P;D' file

Print lines as normal until one containing either real or corr, then gather up the following lines removing the newline and the start of line information. On change of key print each line.

anubhava
  • 761,203
  • 64
  • 569
  • 643
potong
  • 55,640
  • 6
  • 51
  • 83
1

Let

real1 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
real2 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
real3 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
173corr 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
512corr 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
513corr 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

be data.txt then

awk '{current=gensub(/[0-9]/, "", "g", $1);if(current==seen){acc=(acc gensub($1, "", 1))}else{print acc;seen=current;acc=$0}}END{print acc}' data.txt

gives output:

real1 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
173corr 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

Explanation: I use two variables: seen to keep current category, where category is defined as content of first column with all digits removed, acc to load content of lines with common category. For every line I calculate current category, if it is same as in previous line I only append content of current line (sans first line content) to my acc, else I print acc, set seen accordingly and set acc to current first line content. In END I do print acc, as otherwise content of last category would be missing.

(tested in GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0))

Daweo
  • 31,313
  • 3
  • 12
  • 25