2

I have a text file containing 10 hundreds of lines, with different lengths. Now I want to select N lines randomly, save them in another file, and remove them from the original file. I've found some answers to this question, but most of them use a simple idea: sort the file and select first or last N lines. unfortunately this idea doesn't work to me, because I want to preserve the order of lines. I tried this piece of code, but it's very slow and takes hours.

FILEsrc=$1;
FILEtrg=$2;
MaxLines=$3;
let LineIndex=1;
while [ "$LineIndex" -le "$MaxLines" ]
do
# count number of lines
NUM=$(wc -l $FILEsrc | sed 's/[ \r\t].*$//g');
let X=(${RANDOM} % ${NUM} + 1);
echo $X;
sed -n ${X}p ${FILEsrc}>>$FILEtrg; #write selected line into target file
sed -i -e  ${X}d ${FILEsrc};       #remove selected line from source file
LineIndex=`expr $LineIndex + 1`;
done

I found this line the most time consuming one in the code:

sed -i -e  ${X}d ${FILEsrc};

is there any way to overcome this problem and make the code faster? Since I'm in hurry, may I ask you to send me complete c/c++ code for doing this?

Hakim
  • 11,110
  • 14
  • 34
  • 37
  • 2
    This could be done very fast in a real programming languages instead of just a shell. – Denys Séguret Sep 10 '12 at 15:20
  • 1
    "Fast ways" include reading the appropriate contents into memory, manipulating, and writing-once or reading the line offsets, performing delta calculations, and writing once .. both are much simpler in other environments and both take advantage of writing once. –  Sep 10 '12 at 15:27
  • 2
    @dystroy: Define "real programming language". – Sebastian Mach Sep 10 '12 at 15:44
  • @phresnel any language complete enough so that you may use the standard API instead of looking for hacks and for the options of all the utilities (sed, awk, etc.) you may have on your computer. Both approach work but one does it in predictable development and execution times. Don't take it wrong : I'm interested in the ingenious way to do it on shell, but I just say it may be easier and more effective to type 30 lines of a programming language. – Denys Séguret Sep 10 '12 at 15:59
  • 1
    @dystroy That's somewhat of a silly argument -- sed, awk, and other tools used in such a fashion can themselves be viewed as an API insofar as there is a defined exposed interface (and "standard" for many Linux distributions). And APIs can be specified as requirements .. of course I agree that it would likely be *easier* to use perl or ruby or python (all of which are programs that one "may have on [their] computer"). –  Sep 10 '12 at 16:21
  • I suppose my immediate jump to recommending another environment is that I do know how to [effectively] use arrays in Bash, which are a fairly vital concept in solving problems like this and are used [in some fashion] in all the answers posted. I found [this tutorial on bash arrays](http://www.thegeekstuff.com/2010/06/bash-array-tutorial/). YMMV. –  Sep 10 '12 at 16:31
  • @dystroy: POSIX _is_ an API (http://pubs.opengroup.org/onlinepubs/9699919799/), sed and awk are part of that API (http://pubs.opengroup.org/onlinepubs/9699919799/, http://pubs.opengroup.org/onlinepubs/9699919799/). Your definition does not convince me. – Sebastian Mach Sep 12 '12 at 13:00
  • Does this answer your question? [Select random lines from a file](https://stackoverflow.com/questions/9245638/select-random-lines-from-a-file) – rogerdpack Dec 02 '21 at 17:08

7 Answers7

6

A simple O(n) algorithm is described in:

http://en.wikipedia.org/wiki/Reservoir_sampling

array R[k];    // result
integer i, j;

// fill the reservoir array
for each i in 1 to k do
    R[i] := S[i]
done;

// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
    j := random(1, i);   // important: inclusive range
    if j <= k then
        R[j] := S[i]
    fi
done
freestyler
  • 5,224
  • 2
  • 32
  • 39
  • here's the same [Algorithm R in Python (`reservoir_sample()` function)](http://stackoverflow.com/a/32792504/4279) – jfs Sep 26 '15 at 00:56
3

Generate all your offsets, then make a single pass through the file. Assuming you have the desired number of offsets in offsets (one number per line) you can generate a single sed script like this:

sed "s!.*!&{w $FILEtrg\nd;}!" offsets

The output is a sed script which you can save to a temporary file, or (if your sed dialect supports it) pipe to a second sed instance:

... | sed -i -f - "$FILEsrc"

Generating the offsets file left as an exercise.

Given that you have the Linux tag, this should work right off the bat. The default sed on some other platforms may not understand \n and/or accept -f - to read the script from standard input.

Here is a complete script, updated to use shuf (thanks @Thor!) to avoid possible duplicate random numbers.

#!/bin/sh

FILEsrc=$1
FILEtrg=$2
MaxLines=$3

# Add a line number to each input line
nl -ba "$FILEsrc" | 
# Rearrange lines
shuf |
# Pick out the line number from the first $MaxLines ones into sed script
sed "1,${MaxLines}s!^ *\([1-9][0-9]*\).*!\1{w $FILEtrg\nd;}!;t;D;q" |
# Run the generated sed script on the original input file
sed -i -f - "$FILEsrc"
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Since I'm in hurry, may I ask you to send me complete c/c++ code for doing this? – Hakim Sep 10 '12 at 16:07
  • Are you joking? The source for `sed` is available at http://savannah.gnu.org/projects/sed – tripleee Sep 10 '12 at 16:36
  • I meant the c++ implementation of the code which does what I want, not source code of sed...!!! – Hakim Sep 10 '12 at 18:07
  • 6
    @Hakim first: Try formulate your question more precisely. Do you want BASH or C/C++ solution? Here are many ways to do this, eg. with bash, perl, python, c, c++ (substitute any language) - so: what you want?!? Second: stackoverflow is __not the your personal programming service__! So, if you in hurry - hire a paid programmer. – clt60 Sep 10 '12 at 18:57
3

[I've updated each solution to remove selected lines from the input, but I'm not positive the awk is correct. I'm partial to the bash solution myself, so I'm not going to spend any time debugging it. Feel free to edit any mistakes.]

Here's a simple awk script (the probabilities are simpler to manage with floating point numbers, which don't mix well with bash):

tmp=$(mktemp /tmp/XXXXXXXX)
awk -v total=$(wc -l < "$FILEsrc") -v maxLines=$MaxLines '
    BEGIN { srand(); }
    maxLines==0 { exit; }
    { if (rand() < maxLines/total--) {
        print; maxLines--;
      } else {
        print $0 > /dev/fd/3
      }
    }' "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"

As you print a line to the output, you decrement maxLines to decrease the probability of choosing further lines. But as you consume the input, you decrease total to increase the probability. In the extreme, the probability hits zero when maxLines does, so you can stop processing the input. In the other extreme, the probability hits 1 once total is less than or equal to maxLines, and you'll be accepting all further lines.


Here's the same algorithm, implemented in (almost) pure bash using integer arithmetic:

FILEsrc=$1
FILEtrg=$2
MaxLines=$3

tmp=$(mktemp /tmp/XXXXXXXX)

total=$(wc -l < "$FILEsrc")
while read -r line && (( MaxLines > 0 )); do
    (( MaxLines * 32768 > RANDOM * total-- )) || { printf >&3 "$line\n"; continue; }
    (( MaxLines-- ))
    printf "$line\n"
done < "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"
chepner
  • 497,756
  • 71
  • 530
  • 681
  • 1
    You're still left with the task of removing the selected lines from the source file. If they are all unique, it's a simple `fgrep -vxf "$FILEtrg" "$FILEsrc"`, or you could inline into the script. – tripleee Sep 10 '12 at 17:11
1

Here's a complete Go program :

package main

import (
    "bufio"
    "fmt"
    "log"
    "math/rand"
    "os"
    "sort"
    "time"
)

func main() {
    N := 10
    rand.Seed( time.Now().UTC().UnixNano())
    f, err := os.Open(os.Args[1]) // open the file
    if err!=nil { // and tell the user if the file wasn't found or readable
        log.Fatal(err)
    }
    r := bufio.NewReader(f)
    var lines []string // this will contain all the lines of the file
    for {
        if line, err := r.ReadString('\n'); err == nil {
            lines = append(lines, line)
        } else {
            break
        }
    }
    nums := make([]int, N) // creates the array of desired line indexes
    for i, _ := range nums { // fills the array with random numbers (lower than the number of lines)
        nums[i] = rand.Intn(len(lines))
    }
    sort.Ints(nums) // sorts this array
    for _, n := range nums { // let's print the line
        fmt.Println(lines[n])
    }
}

Provided you put the go file in a directory named randomlines in your GOPATH, you may build it like this :

go build randomlines

And then call it like this :

  ./randomlines "path_to_my_file"

This will print N (here 10) random lines in your files, but without changing the order. Of course it's near instantaneous even with big files.

Denys Séguret
  • 372,613
  • 87
  • 782
  • 758
  • Can you tell more about what algorithm you used? Especially for readers not familiar with go? – Sebastian Mach Sep 10 '12 at 15:46
  • 2
    +1 not because I favor this approach, but to counter the inexplicable downvote. Voter, please explain? – tripleee Sep 10 '12 at 15:55
  • 1
    It just extract the lines, builds an array of integers between 0 and the lines number, sorts it, and uses this array to print the lines. The point of avoiding looking for shell hacks is that you may easily change it as you wish. Your favorite language would do it as easily (java, python, etc.). – Denys Séguret Sep 10 '12 at 15:56
0

Here's an interesting two-pass option with coreutils, sed and awk:

n=5
total=$(wc -l < infile)

seq 1 $total | shuf | head -n $n                                           \
| sed 's/^/NR == /; $! s/$/ ||/'                                           \
| tr '\n' ' '                                                              \
| sed 's/.*/   &  { print >> "rndlines" }\n!( &) { print >> "leftover" }/' \
| awk -f - infile

A list of random numbers are passed to sed which generates an awk script. If awk were removed from the pipeline above, this would be the output:

{ if(NR == 14 || NR == 1 || NR == 11 || NR == 20 || NR == 21 ) print > "rndlines"; else print > "leftover" }

So the random lines are saved in rndlines and the rest in leftover.

Thor
  • 45,082
  • 11
  • 119
  • 130
0

Mentioned "10 hundreds" lines should sort quite quickly, so this is a nice case for the Decorate, Sort, Undecorate pattern. It actually creates two new files, removing lines from the original one can be simulated by renaming.

Note: head and tail cannot be used instead of awk, because they close the file descriptor after given number of lines, making tee exit thus causing missing data in the .rest file.

FILE=input.txt
SAMPLE=10
SEP=$'\t'

<$FILE nl -s $"SEP" -nln -w1 | 
  sort -R |
  tee \
    >(awk "NR >  $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.rest) \
    >(awk "NR <= $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.sample) \
>/dev/null

# check the results
wc -l $FILE*

# 'remove' the lines, if needed
mv $FILE.rest $FILE
liborm
  • 2,634
  • 20
  • 32
-1

This might work for you (GNU sed, sort and seq):

n=10
seq 1 $(sed '$=;d' input_file) |
sort -R |
sed $nq | 
sed 's/.*/&{w output_file\nd}/' | 
sed -i -f - input_file

Where $n is the number of lines to extract.

potong
  • 55,640
  • 6
  • 51
  • 83