0

My goal is to be able to iterate over all bitstrings read from a text file so I can compute the hamming distance between all combinations of the strings. For example, I have a txt file that contains 500 bitstrings, where each bitstring has a length of 5093. I would like to read strings s1 and s2 from the file, then compute the hamming distance between them. Essentially, I'm trying to iterate through the strings in the file to compute the HD for all 500*499/2 = 124,750 combinations so I can compute the mean, std dev, and plot a histogram. I was able to do this in python by using readlines() to read the strings and store them in a list. Then, use a for loop to iterate through all (s1) strings and compare them to the (s2) strings read from the list using a nested for loop. Now, I'm re-approaching the problem to brush up on my C. My current approach involves iterating through the file in a similar fashion and reading the bitstrings using two calls to fgets(), then stripping the carriage return. The problem I'm having is that when I try to call the second fgets() to get s2, the end of the bitstrings are cut ~300 characters short and I compute the hamming distance 499 times instead of 127,450 distance calculations that are expected. When I use fgets() once and comment out the nested while loop, I'm able to read the full bitstring. If you could help me understand the problem with my implementation and the proper approach to achieving my goal, it would be greatly appreciated. Thanks!

EDIT: Initialized the variables, and reset both i and hd for HD calculation. Provided a similar example of the txt file containing the bitstrings. In this example, there are 4 bitstrings of length 16 instead of 500 bitstrings of length 5093. In this case, the goal is to calculate the HD of all 6 combinations of bitstring pairs.

sample txt file

0011010000111010
1001001001110100
1110110010000100
0111011011111001

Code

#include<stdio.h>
#include<stdlib.h>
#include<string.h>

#define SIZE 6000
#define CHIPS 500

int main(int argc, char *argv[]) {

    FILE* fp;

    char buffer[SIZE];
    char s1[SIZE];
    char s2[SIZE];
    int i = 0, j = 0, hd = 0;

    if(argc != 2) {
        fprintf(stderr, "USAGE: ./<executable> <bitstring file>\n");
        return 1;
    }
    else if ((fp = fopen(argv[1], "r")) == NULL) {
        perror("ERROR: File not found.\n");
        return 1;
    }
/*    for(i = 0; i < CHIPS; i++) {
        fgets(s1,sizeof(s1),fp);
        s1[strlen(s1) - 1] = '\0';
        printf("%s\n", s1);
        printf("%d\n", i);
        for(j = 0; j < CHIPS; j++) {
            fgets(s2, sizeof(s2),fp);
            s2[strlen(s2) - 1] = '\0';
            printf("%s\n", s2);
            printf("%d", j);
        }

    }
    fclose(fp);
*/
    while(fgets(s1,sizeof(s1), fp) != NULL) {
        //memcpy(s1,buffer, sizeof(s1));
        s1[strlen(s1) - 1] = '\0';
        printf("%s\n", s1);

        while(fgets(s2, sizeof(s2), fp) != NULL) {
            s2[strlen(s2) - 1] = '\0';

            while(s1[i] != '\0') {
                if(s1[i] != s2[i])
                    hd++;
                i++;
            }
            printf("Hamming Distance: %d\n", hd);
            i = 0;
            hd = 0;
        }

    }
    fclose(fp);

    return 0;
}

Sample output

...
Hamming Distance: 2576
pkitsos
  • 23
  • 7
  • In C, `int i, j, hd = 0;` initializes only `hd` to `0`. You need to write `int i=0, j=0, hd=0;` explicitly to initialize all of them. Otherwise their value will be unspecfied. –  Dec 24 '18 at 02:29
  • That only initializes `hd`. When you're declaring, you should initialize each variable individually, such as `int i=0, j=0, hd=0;`. Or you could declare them with `int i, j, hd;` and then initialize all at once with `i = j = hd = 0;`. – Yuri J Dec 24 '18 at 02:31
  • As you wrote, you first need to read in all strings (in one loop), then use a nested loop to get all hamming distance of pairs. You are somehow trying to do the reading loop in one with the distance calculation loops. This won't work, because the file is read linearly. –  Dec 24 '18 at 02:33
  • Ahh, didn't know that. Thanks for the info. – pkitsos Dec 24 '18 at 02:33
  • Okay, so does that mean I read all the strings into an array using a single call to fgets() for the first loop. Then, iterate through all HD pairs in a separate nested loop? Or do I use a buffer in the first loop and retrieve the HD pairs in the nested loop using memcpy()? – pkitsos Dec 24 '18 at 02:45
  • `fgets()` is for reading _text_. Please post examples of what you call _bit-strings_ and how there are like _text_. Are they made of of characters `'0'`, `'1'`, `'\n'` or something else? – chux - Reinstate Monica Dec 24 '18 at 02:53
  • `s1[strlen(s1) - 1] = '\0';` has 2 weakness. If the input lacks a trailing `'\n'`, the operation is lopping off data. If due to errant or nefarious input, `strlen(s1)==0` is possible and then code is a hacker exploit attempting to set data outside `s1[]` range. See [Removing trailing newline character from fgets() input](https://stackoverflow.com/q/2693776/2410359) – chux - Reinstate Monica Dec 24 '18 at 02:58
  • I've edited my post to show a similar example of the txt file where there are 4 bitstrings of length 16. In this case the goal is to compute the HD of all 6 combinations of bitstring pairs. Also, you make a good point regarding the weakness of that line. @chux – pkitsos Dec 24 '18 at 03:14
  • 1
    The hamming distance of only the first bit strings with others will be calculated, at the start of the first iteration of the outer while loop `fgets()` will read the first bit-string and the inner while loop will read rest if the bitstrings. When the control goes back to the outer while loop for the second iteration, the file pointer will be pointing to EOF and will return `NULL`. Better use a data structure to first store all the strings and then perform the operation. – Pranav Chaudhary Dec 24 '18 at 03:26

1 Answers1

0

OP already understands (per the comments) about a mistake to now initialize variables .

To loop thought N*(N-1)/2 times, a simple approach remembers the file offset of the end of the current s1 line. Later code seeks to that each loop.

More robust code would read all into internal memory - but the below is a quick-to-code alternative.

As with much code development, first concentrate on getting the function right and then improve performance.

#include<stdio.h>
#include<stdlib.h>
#include<string.h>

#define SIZE 6000
#define CHIPS 500

int main(void) {
  FILE* fp;
  char s1[SIZE];
  char s2[SIZE];

  fp = fopen("junk.txt", "w");
  if (fp == NULL) {
    perror("ERROR: File.\n");
    return 1;
  }
  fprintf(fp, "%s\n","0011010000111010");
  fprintf(fp, "%s\n","1001001001110100");
  fprintf(fp, "%s\n","1110110010000100");
  fprintf(fp, "%s\n","0111011011111001");
  fclose(fp);

  FILE *fp1 = fopen("junk.txt", "r");
  if (fp1 == NULL) {
    perror("ERROR: File not found.\n");
    return 1;
  }

  long offset = 0;
  for (;;) {
    fseek(fp1, offset, SEEK_SET);
    if (fgets(s1, sizeof(s1), fp1) == NULL) break;
    s1[strcspn(s1, "\n")] = 0;
    offset = ftell(fp1);  // record location
    if (offset == -1) break;

    while (fgets(s2, sizeof(s2), fp1) != NULL) {
      s2[strcspn(s2, "\n")] = 0;
      size_t i = 0;
      size_t hd = 0;
      while (s1[i] >= '0' && s1[i] <= '1') {
        if (s1[i] != s2[i]) {
          hd++;
        }
        i++;
      }
      printf("s1 <%s> " "s2 <%s> " "Hamming Distance: %zu\n", s1 ,s2, hd);
    }

  }
  fclose(fp);
  puts("Done");
  return 0;
}

Output: 6 Hamming codes as expected per 4*3/2

s1 <0011010000111010> s2 <1001001001110100> Hamming Distance: 8
s1 <0011010000111010> s2 <1110110010000100> Hamming Distance: 10
s1 <0011010000111010> s2 <0111011011111001> Hamming Distance: 6
s1 <1001001001110100> s2 <1110110010000100> Hamming Distance: 10
s1 <1001001001110100> s2 <0111011011111001> Hamming Distance: 8
s1 <1110110010000100> s2 <0111011011111001> Hamming Distance: 10
Done
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256