1

I have a text with repeated data patterns, and grep keeps getting all matches without stop.

for ((count = 1; count !=17; count++)); do       # 17 times 
 xuz1[count]=`grep -e "1  O1" $out_file | cut -c10-29`    
 xuz2[count]=`grep -e "2  O2" $out_file | cut -c10-29`
 xuz3[count]=`grep -e "3  O3" $out_file | cut -c10-29`

 echo ${xuz1[count]}
 echo ${xuz2[count]}
 echo ${xuz3[count]}
done

data looks like:

some text.....
Text....
 .....
1  O1    111111 111111 111111
2  O2    222211 222211 222211
3  O3    643653 652346 757686
some text.....
1  O1    111122 111122 111122
2  O2    222222 222222 222222
3  O3    343653 652346 757683
some text.....
1  O1    111333 111333 111333
2  O2    222333 222333 222333
3  O3    343653 652346 757684
.
.
.

And result I'm getting:

  xuz1[1] = 111111 111111 111111  
  xuz2[1] = 222211 222211 222211 
  xuz3[1] = 643653 652346 757686

  xuz1[2] = 111111 111111 111111  
  xuz2[2] = 222211 222211 222211 
  xuz3[2] = 643653 652346 757686        

...

looking for result like this:

 xuz1[1]=111111 111111 111111 
 xuz2[1]=222211 222211 222211
 xuz3[1]=343653 652346 757683

 xuz1[2]=111122 111122 111122 
 xuz2[2]=222222 222222 222222 
 xuz3[2]=343653 652346 757684

also tried "grep -m 1 -e" Which way should I go?

for now I ended up with one line
grep -A4 -e "1 O1" $out_file | cut -c10-29

Some text.... Is a huge text part.

  • 2
    The immediate problem is [quoting](http://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-variable) but it looks like you should do a single pass over the file with Awk instead. – tripleee Aug 21 '16 at 12:45
  • pluse-uno for small sample data, required output, current output and ... gasp, some code! This a model Q for shell scripting of a certain problem domain! Keep posting and Good luck! – shellter Aug 21 '16 at 18:12
  • Thanks, I'll try it – Igor Pavlenko Aug 22 '16 at 06:19
  • why Grep doesn't read next pattern, it's just keeps returning to the beginning of the file. – Igor Pavlenko Aug 22 '16 at 07:55
  • I don't see why you would expect a new `grep` command to know where a previous `grep` command found a match. They don't communicate (and even if they could, not searching from the beginning of the file if you have grepped the same file before would be highly annoying most of the time). – tripleee Aug 23 '16 at 06:16
  • grep -A4 works the best getting 4 lines and then out it to temporary txt file, where later I can sort it. – Igor Pavlenko Sep 10 '16 at 02:32
  • Nothing wrong with your question, just like to point out that problems like this should raise flags to **not use bash**. It's very hard to read, not unit-testable and all in all almost impossible to maintain. Other languages like Python or Ruby produce much more readable & testable solutions. – Jan Groth Sep 11 '16 at 06:00

3 Answers3

2

A little bash script with a single grep is enough

grep -E '^[0-9]+ +O[0-9]+ +.*'|
while read idx oidx cols; do
  if ((idx == 1)); then
    let ++i
    name=xuz$i
    let j=1
  fi
  echo "$name[$j]=$cols"
  let ++j
done
  • any other way to do it without Perl? don't know what they have on server, want to keep it as bash to make sure it will work as a shelll script – Igor Pavlenko Aug 22 '16 at 06:20
  • it was never perl - just bash and previous solution used grep -P which has been dropped – pakistanprogrammerclub Aug 22 '16 at 12:11
  • 1
    according to GNU.ORG grep -P " Interpret the pattern as a Perl-compatible regular expression (PCRE). This is highly experimental and ‘grep -P’ may warn of unimplemented features." – Igor Pavlenko Sep 10 '16 at 02:27
0

You haven't really described what you want, but I guess something like this.

awk '! /^[1-9][0-9]*  O[0-9] / { n++; m=0; if (NR>1) print ""; next }
    { print "xuz" ++m "[" n "]=" substr($0, 10) }' "$out_file"

If the regex doesn't match, we assume we are looking at one of the "some text" pieces, and that this starts a new record. Increment n and reset m. Otherwise, print the output for this item within this record.

If some text could be more than one line, you will need a minor change, but I hope this should be enough at least to send you in the right direction.

You can do this in pure Bash, too, though this is going to be highly inefficient - you would expect a Bash while read loop to be at least a hundred times slower than Awk, and the code is markedly less idiomatic and elegant.

while read -r m x result; do
  case $m::$x in
    [1-9]::O[1-9])
      printf 'xuz%d[%d]=%s\n' $m $n "$result;;
    *)
        # If n is unset, don't print an empty line
        printf '%s' "${n+$'\n'}"
        let ((n++));;
    esac
done <"$out_file"

I would aggressively challenge any requirement to do this in pure Bash. If it's for homework, the requirement is unrealistic, and a core skill for shell script authors is to understand the limits of the shell and the strengths of the common support tools like Awk. The Awk language is virtually guaranteed to be available wherever you have a shell, in particular a heavy shell like Bash. (In a limited e.g. embedded environment, a limited shell like Dash would make more sense. Then e.g. the let keyword won't be available, though it should not be hard to make this script properly portable.)

The case statement accepts glob patterns, not regular expressions, so the pattern here is slightly less general (we accept one positive digit in the first field).

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • That's a highly suspicious requirement, but see update just now. (Unfortunately, not in a place where I can test; be prepared for minor typos or syntax fibs.) – tripleee Sep 10 '16 at 07:44
0

Thank you all for participating in discussion.

*** this is my home project to help my wife do extract data from research calculations /// speed up is around 400 times **

file used for extracting data from, contains around 2000 lines, needed data blocks look like this and they're repeated 10-20 times in the file.

 uiyououy COORDINATES

 NR  ATOM    CCCCC       X              Y              Z

   1  O1      8.00    0.000000000    0.882236820   -0.789494235
   2  O2      8.00    0.000000000   -1.218250722   -1.644061652
   3  O3      8.00    0.000000000    1.218328524    0.400260050
   4  O4      8.00    0.000000000   -0.882314622    2.033295837

 Text text text text
 tons of text 

to extract 4 lines I used expression below

grep -A4 --no-group-separator -e "1  O1" $from_file | cut -c23-64     
>xyz_temp.txt       
   # grep 4 lines at once to txt
sed -i '/^[ \t]*$/d' xyz_temp.txt                                                                                       
   #del empty lines from xyz txt

next is to convert string in to numbers (should use '| bc -l' for arithmetic)

while IFS= read line
do 
IFS=' ' read -r -a arr_line <<< "$line"                                                                             
 # break line of xyz into 3 numbers 
 s1=$(echo "${arr_line[0]}" \* 0.529177249 | bc -l)                                                        
 # some math convertion
 s2=$(echo "${arr_line[1]}" \* 0.529177249 | bc -l)
 s3=$(echo "${arr_line[2]}" \* 0.529177249 | bc -l)

 #-------to array non sorted ------------
 arr[$n]=${n}";"${from_file}";"${gd_}";"${frt[count_4s]}";"${n4}";"${s1}";"${s2}";"${s3}                                  
 echo ${arr[n]}
 #--------------------------------------------
done <"$from_file_txt"

sort array

IFS=$'\n' sorted=($(sort -t \; -k4 -k5 -g <<<"${arr[*]}"))                      
 # -t separator ';'  -k column  -g generic   * to get new line output  
#-k4 -k5 sort by column 4 then5
#printf "%s\n" "${sorted[*]}"
unset IFS

There is Last part which will combine data to result view

echo "$n"
n2=1
n42=1
count_4s2=1
i=0
echo "============================== sorted =============================="
################### loop for empty 4s lines

printf "%s" ";" ";" ";" ";" ";" "${count_4s2}" ";"
printf "%s\n"
printf "%s\n" "${sorted[i]}"
while [ $i -lt $((n-2)) ] 
 do
 i=$((i+1))
 if [ "$n42" = "4" ]                              #  1234
 then  n42=0
  count_4s2=$((count_4s2+1))
  printf "%s" ";" ";" ";" ";" ";" "${count_4s2}" ";"
  printf "%s\n"
 fi
 #--------------------------------------------
n2=$((n2+1))
n42=$((n42+1))
printf "%s\n" "${sorted[i]}"
done ############# while
#00000000000000000000000000000000000000
printf "%s\n"
echo ==END===END===END==

Output looks like this

============================== sorted ==============================
;;;;;1;
17;A-13_A1+.out;1.3;0.4;1;0;.221176355474853043;-.523049776514580244
18;A-13_A1+.out;1.3;0.4;2;0;-.550350051428402955;-.734584881824005358
19;A-13_A1+.out;1.3;0.4;3;0;.665269869069959489;.133910683627893251
20;A-13_A1+.out;1.3;0.4;4;0;-.336096173116409577;1.123723974181515102
;;;;;2;
13;A-13_A1+.out;1.3;0.45;1;0;.279265277182782148;-.504490787956469897
14;A-13_A1+.out;1.3;0.45;2;0;-.583907412327951988;-.759310392973448167
15;A-13_A1+.out;1.3;0.45;3;0;.662538493711206290;.146829200993661293
16;A-13_A1+.out;1.3;0.45;4;0;-.357896358566036450;1.116971979936256771
;;;;;3;
9;A-13_A1+.out;1.3;0.5;1;0;.339333719743262501;-.482029749553797105
10;A-13_A1+.out;1.3;0.5;2;0;-.612395507070451545;-.788968880150283253
11;A-13_A1+.out;1.3;0.5;3;0;.658674809217196345;.163289820251690233
12;A-13_A1+.out;1.3;0.5;4;0;-.385613021360830052;1.107708808923212876

==END===END===END==

*note : some code might not shown here

next step is to paste it to excel with ; separator.