0

I have the following code to choose 2 random lines from a file containing lines of the form ip:port:

import (
  "os"
   "fmt"
"math/rand"
"log"
"time"
"unicode/utf8"

//"bufio"
)
func main() {
fmt.Println("num bytes in line is: \n", utf8.RuneCountInString("10.244.1.8:8080"))
file_pods_array, err_file_pods_array := os.Open("pods_array.txt")
if err_file_pods_array != nil {
        log.Fatalf("failed opening file: %s", err_file_pods_array)
}
//16 = num of bytes in ip:port pair
randsource := rand.NewSource(time.Now().UnixNano())
                randgenerator := rand.New(randsource)
                firstLoc := randgenerator.Intn(10)
                secondLoc := randgenerator.Intn(10)
                candidate1 := ""
                candidate2 := ""
num_bytes_from_start_first := 16 * (firstLoc + 1)
num_bytes_from_start_second := 16 * (secondLoc + 1)
    buf_ipport_first := make([]byte, int64(15))
    buf_ipport_second := make([]byte, int64(15))
    start_first := int64(num_bytes_from_start_first)
    start_second := int64(num_bytes_from_start_second)
    _, err_first := file_pods_array.ReadAt(buf_ipport_first, start_first)
    first_ipport_ep := buf_ipport_first
    if err_first == nil {
            candidate1 = string(first_ipport_ep)
    }
    _, err_second := file_pods_array.ReadAt(buf_ipport_second, start_second)
    second_ipport_ep := buf_ipport_second

    if err_second == nil {
            candidate2 = string(second_ipport_ep)
    }
fmt.Println("first is: ", candidate1)
fmt.Println("sec is: ", candidate2)
}

This sometimes prints empty or partial lines. Why does this happen and how can I fix it?

Output example:

num bytes in line is:
 15
first is: 10.244.1.17:808
sec is:
10.244.1.11:80

Thank you.

raizik
  • 19
  • 6

2 Answers2

1

If your lines were of a fixed length you could do this in constant time.

  1. Length of each line is L.
  2. Check the size of the file, S.
  3. Divide S/L to get the number of lines N.
  4. Pick a random number R from 0 to N-1.
  5. Seek to R*L in the file.
  6. Read L bytes.

But you don't have fixed length lines. We can't do constant time, but we can do it in constant memory and O(n) time using the technique from The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.

  1. Read a line. Remember its line number M.
  2. Pick a random number from 1 to M.
  3. If it's 1, remember this line.

That is, as you read each line you have a 1/M chance of picking it. Cumulatively this adds up to 1/N for every line.

If we have three lines, the first line has a 1/1 chance of being picked. Then a 1/2 chance of remaining. Then a 2/3 chance of remaining. Total chance: 1 * 1/2 * 2/3 = 1/3.

The second line has a 1/2 chance of being picked and a 2/3 chance of remaining. Total chance: 1/2 * 2/3 = 1/3.

The third line has a 1/3 chance of being picked.

package main

import(
    "bufio"
    "fmt"
    "os"
    "log"
    "math/rand"
    "time"
);

func main() {
    file, err := os.Open("pods_array.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    scanner := bufio.NewScanner(file)

    randsource := rand.NewSource(time.Now().UnixNano())
    randgenerator := rand.New(randsource)

    lineNum := 1
    var pick string
    for scanner.Scan() {
        line := scanner.Text()
        fmt.Printf("Considering %v at 1/%v.\n", scanner.Text(), lineNum)

        // Instead of 1 to N it's 0 to N-1
        roll := randgenerator.Intn(lineNum)        
        fmt.Printf("We rolled a %v.\n", roll)
        if roll == 0 {
            fmt.Printf("Picking line.\n")
            pick = line
        }

        lineNum += 1
    }

    fmt.Printf("Picked: %v\n", pick)
}

Because rand.Intn(n) returns [0,n), that is from 0 to n-1, we check for 0, not 1.


Maybe you're thinking "what if I seek to a random point in the file and then read the next full line?" That wouldn't quite be constant time, it would beO(longest-line), but it wouldn't be truly random. Longer lines would get picked more frequently.


Note that since these are (I assume) all IP addresses and ports you could have constant record lengths. Store the IPv4 address as a 32 bits and the port as a 16 bits. 48 bits per line.

However, this will break on IPv6. For forward compatibility store everything as IPv6: 128 bits for the IP and 16 bits for the port. 144 bits per line. Convert IPv4 addresses to IPv6 for storage.

This will allow you to pick random addresses in constant time, and it will save disk space.

Alternatively, store them in SQLite.

Schwern
  • 153,029
  • 25
  • 195
  • 336
0

found a solution using ioutil and strings:

func main() {
randsource := rand.NewSource(time.Now().UnixNano())
                randgenerator := rand.New(randsource)
                firstLoc := randgenerator.Intn(10)
                secondLoc := randgenerator.Intn(10)
                candidate1 := ""
                candidate2 := ""

        dat, err := ioutil.ReadFile("pods_array.txt")
        if err == nil {
                ascii := string(dat)
                splt := strings.Split(ascii, "\n")
candidate1 = splt[firstLoc]
candidate2 = splt[secondLoc]
        }
fmt.Println(candidate1)
fmt.Println(candidate2)
}

Output

10.244.1.3:8080
10.244.1.11:8080
raizik
  • 19
  • 6
  • This reads the entire file into memory, splits the resulting string into an array of lines, and then picks two random strings from that array. It is not constant time, it is O(n) memory and O(n) time. – Schwern Feb 06 '20 at 19:14
  • right, my bad. I eventually did an external bash script (that's called from a python script) that writes two random lines from the file and runs in the background and in in Go I read from a single-line file. – raizik Feb 06 '20 at 19:43
  • You're calling Bash from Python from Go? – Schwern Feb 06 '20 at 19:56
  • No, I'm calling Bash from python and the Python script is the client requests/jobs that are handled by Go and the Bash is run in parallel with the jobs sent. – raizik Feb 06 '20 at 20:09
  • Doing this in a bash script also will not be in constant time, and calling it will be slower and more fragile. – Schwern Feb 06 '20 at 20:12
  • Why? I used ```shuf ``` and the reading in Go is done in constant time, no? – raizik Feb 06 '20 at 20:33
  • I really don't get what you're trying to do here anymore. If it's an option to select a random line with any complexity if done concurrently just launch a goroutine to do it. But I'm telling you again: 160 bytes is *nothing*. You can read the whole thing with a single syscall. None of your "optimizations" matter. – Peter Feb 06 '20 at 21:08
  • @Rachelia Perhaps we have different meanings for "constant time". That means no matter how large the file gets the time to pick a random line remains constant. `shuf` will take longer for larger files, it has to read the whole file and shuffle the lines. `ioutil.ReadFile` also has to read the whole file which will take longer for larger files. `Split` has to split the whole read string back into lines which will take longer as the string gets longer. You can test this by timing larger and larger files. See https://en.wikipedia.org/wiki/Time_complexity – Schwern Feb 06 '20 at 21:20
  • I always have 160 bytes file. @Peter it is nothing in general perhaps, but in my case I have a Python script that sends 1000 HTTP requests to the Go code where I need to choose two random IPs and do some calculations on them in order to decide whose the next candidate to handle the request. Reading the whole file each time an IP is chosen in this way is wasteful. – raizik Feb 06 '20 at 21:29