0

I'm trying to implement Huffman compression. After encoding the char into 0 and 1, how do I write it to a file so it'll be compressed? Obviously simply writing the characters 0,1 will only make the file larger.

Let's say I have a string "0101010" which represents bits. I wish to write the string into file but as in binary, i.e. not the char '0' but the bit 0.

I tried using the getBytesArray() on the string but it seems to make no difference rather than simply writing the string.

jemiloii
  • 24,594
  • 7
  • 54
  • 83
ChikChak
  • 936
  • 19
  • 44
  • 1
    What......... do you mean by writing the "bit 0". I think you have misunderstood a bit about a bit. – sarveshseri Jan 18 '18 at 17:02
  • 1
    Write to a binary file. – ChikChak Jan 18 '18 at 17:06
  • 1
    Your example string has only 7 chars so only 7 bits. You can't write into a file in bits, you must write in bytes. So what do you expect to happen? Do you always really have a string of a multiply of 8 chars? Or do you have some logic to pad the last byte (presumable with `0`s or with `1`s)? Or something else? – SergGr Jan 18 '18 at 17:17
  • I'm trying to implement Huffman compression. After encoding the char into 0 and 1, how do I write it to a file so it'll be compressed? Obviously simply writing the characters 0,1 will only make the file larger. – ChikChak Jan 18 '18 at 18:20
  • 1
    Just to make this explicit - you cannot write less than a byte to file. Here's a bit describing how to pad Huffman-encoded Strings using an "EOF" code: https://www2.cs.duke.edu/csed/poop/huff/info/ – László van den Hoek Jan 18 '18 at 19:28
  • I think author presumes a correct number of bits -- that is, divisible by 8. – VasiliNovikov Jan 18 '18 at 19:30
  • Could you explain to me how the EOF padding should be done? It's enough to pad only this char, right? – ChikChak Jan 18 '18 at 19:38
  • I don't know Scala at all, so this maybe off base, but it sounds like you're expecting a string to be bits magically. But strings are already bits, bits arranged to be a series of characters, a process which you'll want to undo. You'll have to convert your string (of 8x1s and 0s) to either a character or an integer that does that represents these bits. Then save that char or integer to the file. https://alvinalexander.com/scala/how-cast-string-to-int-in-scala-string-int-conversion – Graham P Heath Jan 18 '18 at 20:01

3 Answers3

2

Although Sarvesh Kumar Singh code will probably work, it looks so Javish to me that I think this question needs one more answer. The way I imaging Huffman coding to be implemented in Scala is something like this:

import scala.collection._

type Bit = Boolean
type BitStream = Iterator[Bit]
type BitArray = Array[Bit]
type ByteStream = Iterator[Byte]
type CharStream = Iterator[Char]

case class EncodingMapping(charMapping: Map[Char, BitArray], eofCharMapping: BitArray)

def buildMapping(src: CharStream): EncodingMapping = {
  def accumulateStats(src: CharStream): Map[Char, Int] = ???

  def buildMappingImpl(stats: Map[Char, Int]): EncodingMapping = ???

  val stats = accumulateStats(src)
  buildMappingImpl(stats)
}

def convertBitsToBytes(bits: BitStream): ByteStream = {
  bits.grouped(8).map(bits => {
    val res = bits.foldLeft((0.toByte, 0))((acc, bit) => ((acc._1 * 2 + (if (bit) 1 else 0)).toByte, acc._2 + 1))
    // last byte might be less than 8 bits
    if (res._2 == 8)
      res._1
    else
      (res._1 << (8 - res._2)).toByte
  })
}

def encodeImpl(src: CharStream, mapping: EncodingMapping): ByteStream = {
  val mainData = src.flatMap(ch => mapping.charMapping(ch))
  val fullData = mainData ++ mapping.eofCharMapping
  convertBitsToBytes(fullData)
}

// can be used to encode String as src. Thanks to StringLike/StringOps extension
def encode(src: Iterable[Char]): (EncodingMapping, ByteStream) = {
  val mapping = buildMapping(src.iterator)
  val encoded = encode(src.iterator, mapping)
  (mapping, encoded)
}

def wrapClose[A <: java.io.Closeable, B](resource: A)(op: A => B): B = {
  try {
    op(resource)
  }
  finally {
    resource.close()
  }
}

def encodeFile(fileName: String): (EncodingMapping, ByteStream) = {
  // note in real life you probably want to specify file encoding as well
  val mapping = wrapClose(Source.fromFile(fileName))(buildMapping)
  val encoded = wrapClose(Source.fromFile(fileName))(file => encode(file, mapping))
  (mapping, encoded)
}

where in accumulateStats you find out how often each Char is present in the src and in buildMappingImpl (which is the main part of the whole Huffman encoding) you first build a tree from that stats and then use the create a fixed EncodingMapping. eofCharMapping is a mapping for the pseudo-EOF char as mentioned in one of the comments. Note that high-level encode methods return both EncodingMapping and ByteStream because in any real life scenario you want to save both.

The piece of logic specifically being asked is located in convertBitsToBytes method. Note that I use Boolean to represent single bit rather than Char and thus Iterator[Bit] (effectively Iterator[Boolean]) rather than String to represent a sequence of bits. The idea of the implementation is based on the grouped method that converts a BitStream into a stream of Bits grouped in a byte-sized groups (except possible for the last one).

IMHO the main advantage of this stream-oriented approach comparing to Sarvesh Kumar Singh answer is that you don't need to load the whole file into memory at once or store the whole encoded file in the memory. Note however that in such case you'll have to read the file twice: first time to build the EncodingMapping and second to apply it. Obviously if the file is small enough you can load it into the memory first and then convert ByteStream to Array[Byte] using .toArray call. But if your file is big, you can use just stream-based approach and easily save the ByteStream into a file using something like .foreach(b => out.write(b))

SergGr
  • 23,570
  • 2
  • 30
  • 51
1

I don't think this will help you achieve your Huffman compression goal, but in answer to your question:

string-to-value

Converting a String to the value it represents is pretty easy.

val x: Int = "10101010".foldLeft(0)(_*2 + _.asDigit)

Note: You'll have to check for formatting (only ones and zeros) and overflow (strings too long) before conversion.

value-to-file

There are a number of ways to write data to a file. Here's a simple one.

import java.io.{FileOutputStream, File}

val fos = new FileOutputStream(new File("output.dat"))
fos.write(x)
fos.flush()
fos.close()

Note: You'll want to catch any errors thrown.

jwvh
  • 50,871
  • 7
  • 38
  • 64
  • `"10101010".foldLeft(0)(_*2 + _.asDigit)` will fail for most strings large enough to be Huffman encodings of anything meaningful. – sarveshseri Jan 19 '18 at 06:09
  • @SarveshKumarSingh; you're right, of course, but, to be fair, I was trying to answer the original question, before @jemiloii "enhanced" (distorted) it with a modified title, tags, and opening paragraph. And I _thought_ I was being quite clear as to what I was trying to demonstrate (string->value conversion) and what was left unaddressed (actual Huffman coding). – jwvh Jan 19 '18 at 06:48
1

I will specify all the required imports first,

import java.io.{ File, FileInputStream, FileOutputStream}
import java.nio.file.Paths

import scala.collection.mutable.ArrayBuffer

Now, We are going to need following smaller units to achieve this whole thing,

1 - We need to be able to convert our binary string (eg. "01010") to Array[Byte],

def binaryStringToByteArray(binaryString: String) = {
  val byteBuffer = ArrayBuffer.empty[Byte]

  var byteStr = ""
  for (binaryChar <- binaryString) {
    if (byteStr.length < 7) {
      byteStr = byteStr + binaryChar
    }
    else {
      try{
        val byte = java.lang.Byte.parseByte(byteStr + binaryChar, 2)
        byteBuffer += byte
        byteStr = ""
      }
      catch {
        case ex: java.lang.NumberFormatException =>
          val byte = java.lang.Byte.parseByte(byteStr, 2)
          byteBuffer += byte
          byteStr = "" + binaryChar
      }
    }
  }

  if (!byteStr.isEmpty) {
    val byte = java.lang.Byte.parseByte(byteStr, 2)
    byteBuffer += byte
    byteStr = ""
  }

  byteBuffer.toArray
}

2 - We need to be able to open the file to serve in our little play,

def openFile(filePath: String): File = {
  val path = Paths.get(filePath)
  val file = path.toFile

  if (file.exists()) file.delete()

  if (!file.exists()) file.createNewFile()

  file
}

3 - We need to be able to write bytes to a file,

def writeBytesToFile(bytes: Array[Byte], file: File): Unit = {
  val fos = new FileOutputStream(file)
  fos.write(bytes)
  fos.close()
}

4 - We need to be able to read bytes back from the file,

def readBytesFromFile(file: File): Array[Byte] = {
  val fis = new FileInputStream(file)
  val bytes = new Array[Byte](file.length().toInt)
  fis.read(bytes)
  fis.close()
  bytes
}

5 - We need to be able convert bytes back to our binaryString,

def byteArrayToBinaryString(byteArray: Array[Byte]): String = {
  byteArray.map(b => b.toBinaryString).mkString("")
}

Now, we are ready to do every thing we want,

// lets say we had this binary string,
scala> val binaryString = "00101110011010101010101010101"
// binaryString: String = 00101110011010101010101010101

// Now, we need to "pad" this with a leading "1" to avoid byte related issues
scala> val paddedBinaryString = "1" + binaryString
// paddedBinaryString: String = 100101110011010101010101010101

// The file which we will use for this,
scala> val file = openFile("/tmp/a_bit")
// file: java.io.File = /tmp/a_bit

// convert our padded binary string to bytes
scala> val bytes = binaryStringToByteArray(paddedBinaryString)
// bytes: Array[Byte] = Array(75, 77, 85, 85)

// write the bytes to our file,
scala> writeBytesToFile(bytes, file)

// read bytes back from file,
scala> val bytesFromFile = readBytesFromFile(file)
// bytesFromFile: Array[Byte] = Array(75, 77, 85, 85)

// so now, we have our padded string back,
scala> val paddedBinaryStringFromFile = byteArrayToBinaryString(bytes)
// paddedBinaryStringFromFile: String = 1001011100110110101011010101

// remove that "1" from the front and we have our binaryString back,
scala> val binaryStringFromFile = paddedBinaryString.tail
// binaryStringFromFile: String = 00101110011010101010101010101

NOTE :: you may have to make few changes if you want to deal with very large "binary strings" (more than few millions of characters long) to improve performance or even be usable. For example - You will need to start using Streams or Iterators instead of Array[Byte].

sarveshseri
  • 13,738
  • 28
  • 47
  • 1
    There are a few things that I don't like in this code that made me provide my own answer. But there is one very tricky place which I believe is worth explicit mentioning: call `fis.read(bytes)` is not reliable. The [`read`](https://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#read(byte[])) returns `int` for a reason. And the really tricky part is that this code might work perfectly well while the file is relatively small and start produce weird output only on big files. See also https://stackoverflow.com/questions/326390/how-do-i-create-a-java-string-from-the-contents-of-a-file – SergGr Jan 19 '18 at 23:49
  • I agree with all your concerns. It is just that I wanted to keep this code as simple and beginner friendly as possible. – sarveshseri Jan 20 '18 at 20:04