1

I'm just beginning an assignment on Huffman encoding. The first step is to implement some form of file handling that will read in the file to be processed and then perform frequency counting of the characters.

I have several different text files to test this against - some are letters, numbers, symbols, uppercase, lowercase etc.

Here is what I have so far:

import java.io.*;
public class LetterFrequency {
int nextChar;
char c;
public static void main(String[] args) throws IOException {
   File txtfile = new File("10000random.txt");
   BufferedReader in = new BufferedReader (new FileReader (txtfile));
       System.out.println("Letter Frequency:");

    int[] count = new int[26];

    while ((nextChar = in.read()) != -1) {
      ch = ((char) nextChar);
      if (ch >= 'a' && ch <= 'z')
      count[ch - 'a']++;
    }


    for (int i = 0; i < 26; i++) {
      System.out.printf("%c %d", i + 'A', count[i]);
    }



in.close();

}

This is obviously a basic version (just handling a-z), how would I change this to include all uppercase letters, numbers, symbols etc. Doesn't seem right to have to guess the size of the array.

Apologies if this is an obvious question, I'm still learning! Thank you

Chid
  • 45
  • 1
  • 8
  • Why not create different arrays for uppercase, lowercase, numbers, symbols, etc? – denvercoder9 Nov 29 '16 at 15:52
  • 1
    alternatively you could use map to store this without needing to know characters u encounter – nafas Nov 29 '16 at 15:55
  • @nafas but that will not keep the characters in lexicographical order. – denvercoder9 Nov 29 '16 at 15:56
  • @RafiduzzamanSonnet question doesn't specify to order the characters, beside map can be sorted however needs to be sorted... – nafas Nov 29 '16 at 16:01
  • There is no text but encoded text. If you are being given the files as text files, you must also be given the character encoding used to make them and then read them with a text library (such as Scanner) using that encoding (for example UTF-8). Regardless, if you are using Java's `String`, `Character` or `char`, your code has to deal with UTF-16 code units, representing Unicode codepoints, representing [graphemes](http://stackoverflow.com/a/27331885/2226988). Or, it might be far simpler to take the files as byte sequences and perform frequency counting on the byte values. – Tom Blodget Nov 29 '16 at 23:41

2 Answers2

1

Are you supporting both single-byte and double-byte characters? Only ASCII characters?

If only ascii, you have (26 * 2) + 10 possible values to cover all lower-case, upper-case and numeric digits.

If you are covering more than just ascii, you can use a Map rather than an array.

Map<Integer, AtomicInteger> map = new HashMap<>();
...
map.computeIfAbsent(ch, c -> new AtomicInteger()).getAndIncrement();
zcarioca
  • 139
  • 9
0
String letterAsString = (ch+"").toUpperCase();

That's a solution if you want to count them the same way as the lower Case letters.

Arol
  • 62
  • 7