1

I have a tab file with two columns like that

5 6 14 22 23 25 27 84 85 88 89 94 95 98 100             6 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406   205 284 307 406
2 10 13 40 47 58                                        2 13 40 87

and the desired output should be

5 6 14 22 23 25 27 84 85 88 89 94 95 98 100             14 27
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406   6 209 299 305
2 10 13 23 40 47 58 87                                  10 23 40 58

I would like to change the numbers in 2nd column for random numbers in 1st column resulting in an output in 2nd column with the same number of numbers. I mean e.g. if there are four numbers in 2nd column for x row, the output must have four random numbers from 1st column for this row, and so on...

I'm try to create two arrays by AWK and split and replace every number in 2nd column for numbers in 1st column but not in a randomly way. I have seen the rand() function but I don't know exactly how joint these two things in a script. Is it possible to do in BASH environment or are there other better ways to do it in BASH environment? Thanks in advance

  • 3
    What is the delimiter between col1 and col2? TAB? Or a fixed number of spaces? – TenG May 10 '19 at 11:06
  • 1
    is a tab delimiter between col1 and col2, but could be other delimiter if make easier the goal – Perceval Vellosillo Gonzalez May 10 '19 at 11:12
  • 1
    Delimiters we can see are always much easier to work with than delimiters we can't see. – Ed Morton May 10 '19 at 13:28
  • 1
    Can those randomly selected numbers repeat? In other words is it sampling with or without replacement? – karakfa May 10 '19 at 14:31
  • 1
    Thanks @EdMorton , I will take account it for other scripts. In fact I will check th other links write below in order to perform the script correctly. It talks about the use of capitalized and lower case for the bash scripts but I didn't know about that. Thanks again for the information – Perceval Vellosillo Gonzalez May 13 '19 at 14:23
  • Those randomly selected cannot be repeated @karakfa , in fact the replacement must be always presented and the previous numbers in col2 cannot appear after random process. e.g. the six number in first row col2 cannot appears in the output file – Perceval Vellosillo Gonzalez May 13 '19 at 14:30
  • 1
    so, it's not *random* random. Better term to use is **sampling without replacement** – karakfa May 13 '19 at 14:43
  • Thanks @karakfa. In summary I need to randomly select "x" numbers from col1 and replace for those in col2 being "x" the number of numbers in col2. However, the replace number cannot be the same presented previously in col2 – Perceval Vellosillo Gonzalez May 13 '19 at 15:05
  • I got it, see my answer below. – karakfa May 13 '19 at 15:52

3 Answers3

1

Assuming that there is a tab delimiting the two columns, and each column is a space delimited list:

awk 'BEGIN{srand()} 
    {n=split($1,a," "); 
    m=split($2,b," "); 
    printf "%s\t",$1; 
    for (i=1;i<=m;i++) 
        printf "%d%c", a[int(rand() * n) +1], (i == m) ? "\n" : " "
    }' FS=\\t input
William Pursell
  • 204,365
  • 48
  • 270
  • 300
1

awk to the rescue!

$ awk -F'\t' 'function shuf(a,n)
                 {for(i=1;i<n;i++)
                    {j=i+int(rand()*(n+1-i));
                     t=a[i]; a[i]=a[j]; a[j]=t}}
             function join(a,n,x,s)
                  {for(i=1;i<=n;i++) {x=x s a[i]; s=" "}
                   return x}
             BEGIN{srand()}
                  {an=split($1,a," ");
                   shuf(a,an);
                   bn=split($2,b," ");
                   delete m; delete c; j=0;
                   for(i=1;i<=bn;i++) m[b[i]];
                   # pull elements from a upto required sample size, 
                   # not intersecting with the previous sample set
                   for(i=1;i<=an && j<bn;i++) if(!(a[i] in m)) c[++j]=a[i];
                   cn=asort(c);
                   print $1 FS join(c,cn)}' file


5 6 14 22 23 25 27 84 85 88 89 94 95 98 100     85 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406   20 205 294 295
2 10 13 23 40 47 58 87  10 13 47 87

shuffle (standard algorithm) the input array, sample required number of elements, additional requirement is no intersection with the existing sample set. Helper structure map to keep existing sample set and used for in tests. The rest should be easy to read.

karakfa
  • 66,216
  • 7
  • 41
  • 56
  • Wow! In fact works correctly @karakfa. I don't undertand totally the 1st part of the script. I understand the 2nd part more or less (after BEGIN); to apply rand function and create the arrays for both columns, after that delete m and go through 2nd column and include in m. Then go through 1st column and j as minor length of 2nd column array. If element a is not in 2nd column, add and sort and avoid duplicates right? However the first part is more complex. I should read more in detail and read about 'shuf' cause I don't understand yet. Anyway thanks so much, it works correctly! – Perceval Vellosillo Gonzalez May 13 '19 at 16:29
  • 1
    that's the shuffle algorithm. When you need sample without replacement with non trivial percent of the population shuffling and picking first **k** elements is easier than *rejecting sampling". If you need a few elements the latter is better. First part is defining two helper functions for shuffling and joining the array back to string form. – karakfa May 13 '19 at 16:32
  • Thanks again! Despite, I will try the shuf function in more simple script based on your previous script in order to understand better the 1st part. – Perceval Vellosillo Gonzalez May 13 '19 at 16:43
0

Try this:

# This can be an external file of course
# Note COL1 and COL2 seprated by hard TAB

cat <<EOF > d1.txt
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100     6 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406   205 284 307 406
2 10 13 40 47 58        2 13 40 87
EOF

# Loop to read each line, not econvert TAB to:, though could have used IFS

cat d1.txt | sed 's/    /:/' | while read LINE
do
   # Get the 1st column data

   COL1=$( echo ${LINE} | cut -d':' -f1 )

   # Get col1 number of items

   NUM_COL1=$( echo ${COL1} | wc -w )

   # Get col2 number of items

   NUM_COL2=$( echo ${LINE} | cut -d':' -f2 | wc -w )

   # Now split col1 items into an array

   read -r -a COL1_NUMS <<< "${COL1}"


   COL2=" "

   # THis loop runs once for each COL2 item

   COUNT=0
   while [ ${COUNT} -lt ${NUM_COL2} ]
   do

      # Generate a random number to use as teh random index for COL1

      COL1_IDX=${RANDOM}
      let "COL1_IDX %= ${NUM_COL1}"

      NEW_NUM=${COL1_NUMS[${COL1_IDX}]}

      # Check for duplicate

      DUP_FOUND=$( echo "${COL2}" | grep ${NEW_NUM} )

      if [ -z "${DUP_FOUND}" ]
      then
         # Not a duplicate, increment loop conter and do next one

         let "COUNT = COUNT + 1 "

         # Add the random COL1 item to COL2

         COL2="${COL2} ${COL1_NUMS[${COL1_IDX}]}"
      fi
   done

   # Sort COL2

   COL2=$( echo ${COL2} | tr ' ' '\012' | sort -n | tr '\012' ' ' )

   # Print

   echo ${COL1} :: ${COL2}
done

Output:

5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 :: 88 95
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 :: 20 299 304 305
2 10 13 40 47 58 :: 2 10 40 58
TenG
  • 3,843
  • 2
  • 25
  • 42
  • 1
    Added an amendment to sort the COL2 numbers. I'm guessing duplicates should be avoided, right? – TenG May 10 '19 at 11:44
  • 1
    Updated above to remove duplicates from COL2. – TenG May 10 '19 at 11:54
  • Right, the duplicates should be avoided. My original file have two previous columns, so the sets of numbers correspond to col3 and col4 but I have changed the script to introduce these two columns and replace the \t to : globally and apparently works correctly. Thanks so much @TenG – Perceval Vellosillo Gonzalez May 10 '19 at 12:35
  • 1
    See http://porkmail.org/era/unix/award.html, https://unix.stackexchange.com/q/169716/133219, https://mywiki.wooledge.org/Quotes, and https://stackoverflow.com/q/673055/1745001 for some of the problems with that script. Don't do it. – Ed Morton May 10 '19 at 13:32
  • @TenG Thanks again for the script. However I need to randomly select "x" numbers from col1 and replace for those in col2 being "x" the number but the replace numbers cannot be the same presented previously in col2. e.g. if six number appears in row1 col2 cannot appears in the output. Only could appear other two numbers presented in col1 row1 different to 6 and 94. I think have not explain completely yesterday – Perceval Vellosillo Gonzalez May 13 '19 at 15:21
  • I dont understand what you're asking. Sorry. Maybe example showing input and expected output would help. – TenG May 13 '19 at 21:41