2

I have several experiments. Each experiment has several replicate files. I want to place all these replicate files into one text file in the following way.

Lets say there are 3 experiments and each experiment has 2 replicate files.(Experiment and replicate number can be more than this)

/home/data/study1/EXP1_30/EXP1_replicate_1_30.txt
/home/data/study1/EXP1_30/EXP1_replicate_2_30.txt
/home/data/study1/EXP1_60/EXP1_replicate_1_60.txt
/home/data/study1/EXP1_60/EXP1_replicate_2_60.txt
/home/data/study1/EXP2_30/EXP2_replicate_1_30.txt
/home/data/study1/EXP2_30/EXP2_replicate_2_30.txt
/home/data/study1/EXP2_60/EXP2_replicate_1_60.txt
/home/data/study1/EXP2_60/EXP2_replicate_2_60.txt
/home/data/study1/EXP3_30/EXP3_replicate_1_30.txt
/home/data/study1/EXP3_30/EXP3_replicate_2_30.txt
/home/data/study1/EXP3_60/EXP3_replicate_1_60.txt
/home/data/study1/EXP3_60/EXP3_replicate_2_60.txt

output file1.txt will look like

/home/data/study1/EXP1/EXP1_replicate_1_30.txt,/home/data/study1/EXP1/EXP1_replicate_2_30.txt \
/home/data/study1/EXP2/EXP2_replicate_1_30.txt,/home/data/study1/EXP2/EXP2_replicate_2_30.txt \
/home/data/study1/EXP3/EXP3_replicate_1_30.txt,/home/data/study1/EXP3/EXP3_replicate_2_30.txt

output file2.txt will look like

/home/data/study1/EXP1/EXP1_replicate_1_60.txt,/home/data/study/EXP1/EXP1_replicate_2_60.txt \
/home/data/study1/EXP2/EXP2_replicate_1_60.txt,/home/data/study1/EXP2/EXP2_replicate_2_60.txt \
/home/data/study1/EXP3/EXP3_replicate_1_60.txt,/home/data/study1/EXP3/EXP3_replicate_2_60.txt

....

My code with for loops:

ID=(30 60)
exp=("EXP1" "EXP2" "EXP3")

d=""
for  txtfile in /home/data/study1/${exp[0]}/${exp[0]}*_${ID[0]}.txt
do
    printf "%s%s" "$d" "$txtfile" 
    d=","
done
printf " \\" 
printf "\n" 

d=""
for txtfile in /home/data/study1/${exp[1]}/${exp[1]}*_${ID[0]}.txt
do

    printf "%s%s" "$d" "$txtfile" 
    d=","
done
printf " \\" 
printf "\n" 

d=""
for txtfile in /home/data/study1/${exp[2]}/${exp[2]}*_${ID[0]}.txt
do

    printf "%s%s" "$d" "$txtfile" 
    d=","
done          

I am using for loops with index numbers for each experiment and replicates which is very time consuming. Is there any easy way?

Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
hash
  • 103
  • 10
  • If you want that output, why did your experiments output those files in the first place? – hek2mgl Aug 07 '14 at 12:50
  • @hek2mgl those output files are coming from another pipeline and I have to process all the files together based on their IDs in this particular format – hash Aug 07 '14 at 12:56
  • 1
    I will never understand why scientific programs produce output which can't be used by the scientist, unless being post-processed. – hek2mgl Aug 07 '14 at 13:00
  • Can't you change (or somebody other) change the process' output in order to produce files which can be read easily be many different applications? Can't you store results in a database? At least the latter should be true, having the information from the question. – hek2mgl Aug 07 '14 at 13:09
  • @hek2mgl No, its a well known pipeline which is used by many other scientists but the research study which I am dealing with requires me to process the data differently thats why I cannot change the way the pipeline is implemented. – hash Aug 07 '14 at 13:17
  • Could you post what you're using at the moment? It sounds like it might be a perfectly valid approach. – Tom Fenech Aug 07 '14 at 13:47
  • @TomFenech Now I have mentioned the code above – hash Aug 07 '14 at 17:45

3 Answers3

1

I think that this does what you want:

#!/bin/bash

ids=( 30 60 )
dir=/home/data/study1

# join glob on comma, add slash at end
# modified from http://stackoverflow.com/a/3436177/2088135
join() { local IFS=,; echo "$* "'\'; } #' <- to fix syntax highlighting

i=0
for id in "${ids[@]}"; do
    s=$(for exp in "$dir"/EXP*"$id"; do join "$exp/"*"$id".txt; done)
    # trim off final slash and output to file
    echo "${s%?}" > file$((++i)).txt
done

Output (note that when testing, I set dir=.):

$ cat file1.txt 
./EXP1_30/EXP1_replicate_1_30.txt,./EXP1_30/EXP1_replicate_2_30.txt \
./EXP2_30/EXP2_replicate_1_30.txt,./EXP2_30/EXP2_replicate_2_30.txt \
./EXP3_30/EXP3_replicate_1_30.txt,./EXP3_30/EXP3_replicate_2_30.txt 
$ cat file2.txt 
./EXP1_60/EXP1_replicate_1_60.txt,./EXP1_60/EXP1_replicate_2_60.txt \
./EXP2_60/EXP2_replicate_1_60.txt,./EXP2_60/EXP2_replicate_2_60.txt \
./EXP3_60/EXP3_replicate_1_60.txt,./EXP3_60/EXP3_replicate_2_60.txt
Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
0

You can use the following bash script:

#!/bin/bash 

i=0; n=0; files=""
sort -t_ -k5 files.txt | while read line ; do
    files="$files $line"
    i=$((i+1))
    if [ $((i%6)) -eq 0 ] ; then
        n=$((n+1))
        cat $files > "$n.txt"
        files=""
    fi
done
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • I guess you have opted for the more portable approach but just in case you weren't already aware (and potentially for the benefit of the OP), bash permits you to use `files+="$line"` `(( i += 1))` or `((++i))`, `(( i % 6 == 0 ))`, etc. – Tom Fenech Aug 07 '14 at 14:09
  • Yes, I had portability in mind when giving the answer – hek2mgl Aug 07 '14 at 14:18
0

You can also make use of a subshell and do it from the command line (your data in dat/experiment.txt) with:

$ ( first=0; cnt=0; grep 30 dat/experiment.txt | sort | while read line; do \
[ "$first" = 0 ] && first=1 || { [ "$cnt" = 0 ] && echo ' \'; }; echo -n $line; \
((cnt++)); [ "$cnt" = 1 ] && echo -n ","; [ "$cnt" = 2 ] && cnt=0; done; \
echo "" ) >outfile1.txt

$ ( first=0; cnt=0; grep 60 dat/experiment.txt | sort | while read line; do \
[ "$first" = 0 ] && first=1 || { [ "$cnt" = 0 ] && echo ' \'; }; echo -n $line; \
((cnt++)); [ "$cnt" = 1 ] && echo -n ","; [ "$cnt" = 2 ] && cnt=0; done; \
echo "" ) >outfile2.txt

Admittedly, the one liner ended up being longer than originally anticipated to match your line continuations -- exactly. If you omit the line continuations in the outfiles, the line reduces to (e.g.):

$ (cnt=0; grep 30 dat/experiment.txt | sort | while read line; do echo -n $line; \
((cnt++)); [ "$cnt" = 1 ] && echo -n ","; [ "$cnt" = 2 ] && echo "" && cnt=0; \ 
done ) >outfile1.txt

output:

$ cat outfile1.txt
/home/data/study1/EXP1_30/EXP1_replicate_1_30.txt,/home/data/study1/EXP1_30/EXP1_replicate_2_30.txt \
/home/data/study1/EXP2_30/EXP2_replicate_1_30.txt,/home/data/study1/EXP2_30/EXP2_replicate_2_30.txt \
/home/data/study1/EXP3_30/EXP3_replicate_1_30.txt,/home/data/study1/EXP3_30/EXP3_replicate_2_30.txt \

$ cat outfile2.txt
/home/data/study1/EXP1_60/EXP1_replicate_1_60.txt,/home/data/study1/EXP1_60/EXP1_replicate_2_60.txt \
/home/data/study1/EXP2_60/EXP2_replicate_1_60.txt,/home/data/study1/EXP2_60/EXP2_replicate_2_60.txt \
/home/data/study1/EXP3_60/EXP3_replicate_1_60.txt,/home/data/study1/EXP3_60/EXP3_replicate_2_60.txt \
David C. Rankin
  • 81,885
  • 6
  • 58
  • 85