Load Word2Vec model in Spark

Question

Is it possible to load a pretrained (binary) model to spark (using scala) ? I have tried to load one of the binary models which was generated by google like this:

    import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}


    val model = Word2VecModel.load(sc, "GoogleNews-vectors-negative300.bin")

but it is not able to locate the metadata directory. I also created the folder and appended the binary file there but it cannot be parsed. I did not find any wrapper for this issue.

Andrew Charneski · Answer 1 · 2017-10-03T20:09:45.160

I wrote a quick function to load in the google news pretrained model into a spark word2vec model. Enjoy.

def loadBin(file: String) = {
  def readUntil(inputStream: DataInputStream, term: Char, maxLength: Int = 1024 * 8): String = {
    var char: Char = inputStream.readByte().toChar
    val str = new StringBuilder
    while (!char.equals(term)) {
      str.append(char)
      assert(str.size < maxLength)
      char = inputStream.readByte().toChar
    }
    str.toString
  }
  val inputStream: DataInputStream = new DataInputStream(new GZIPInputStream(new FileInputStream(file)))
  try {
    val header = readUntil(inputStream, '\n')
    val (records, dimensions) = header.split(" ") match {
      case Array(records, dimensions) => (records.toInt, dimensions.toInt)
    }
    new Word2VecModel((0 until records).toArray.map(recordIndex => {
      readUntil(inputStream, ' ') -> (0 until dimensions).map(dimensionIndex => {
        java.lang.Float.intBitsToFloat(java.lang.Integer.reverseBytes(inputStream.readInt()))
      }).toArray
    }).toMap)
  } finally {
    inputStream.close()
  }
}

what about fasttext? how we can load a fasttext .bin to each executor once. I tried to do that but the model is loaded per partition which isnot good when there is a high number of prtition — bib, Feb 20 '19 at 05:27
it sounds like you need to use broadcast... load the model once, on the driver, then distributed it via a broadcast wrapper. — Andrew Charneski, Mar 10 '19 at 18:01
@Andew Cherneski can you please my question about the same subject https://stackoverflow.com/questions/54540970/how-to-load-a-file-in-each-executor-once?noredirect=1#comment96563878_54540970 — bib, Mar 11 '19 at 12:53

score 0 · Answer 2 · edited May 23 '17 at 12:26

0

It is an unresolved issue: https://issues.apache.org/jira/browse/SPARK-15328

Either look at the specific code and try to recreate something for yourself or maybe use a python or C script to convert the binary to txt data and work from there.

Convert word2vec bin file to text

edited May 23 '17 at 12:26

Community

1
1

answered May 09 '17 at 14:05

Tom Lous

2,819
2
25
46

1

After converting bin to text file how should I load the model? – LonsomeHell Jul 27 '17 at 10:35

Load Word2Vec model in Spark

2 Answers2

Linked