Best Practice of mapping String to a unique Integer in distributed mode

Question

I have a dataset with 40K entries, each entry look like the following :

product/productId: B00004CK40   review/userId: A39IIHQF18YGZA   review/profileName: C. A. M. Salas  review/helpfulness: 0/0 review/score: 4.0   review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form.

I'm using Spark to build a recommendation engine using :

org.apache.spark.mllib.recommendation.ALS;
org.apache.spark.mllib.recommendation.MatrixFactorizationModel;
org.apache.spark.mllib.recommendation.Rating;

I want to know what is the best practice to convert a userId & productId String to a unique Integer in distributed mode using spark, The reason I want to convert to Integers is that recommendation.Rating constructor is as following

Rating(int user, int product, double rating)

In addition to that I'll need to preserve the mapping in order to go back from Integer to the appropriate movieId when I return the Recommended movieIds for certain user. so a client could enter a userId String, and the output would be top10 recommended movieId Strings.

code snippet of building the Rating RDD:

JavaRDD<Rating> ratings = movieData.map(
        new Function<String,Rating>() {
            @Override
            public Rating call(String s) throws Exception {
                //getting entry data
                int movieIdConverted, userIdConverted;
                String[] data = s.split("\t");
                String movieId = data[0].split(":")[1].trim();
                String userId = data[1].split(":")[1].trim();
                String movieScore = data[4].split(":")[1].trim();

                if(movieIdsHashMapMirror.containsKey(movieId)) {
                    movieIdConverted = movieIdsHashMapMirror.get(movieId);
                } else {
                   //saving mappings of movieId
                    movieIdConverted = movieIdCounter;
                    movieIdsHashMap.put(movieIdCounter, movieId);
                    movieIdsHashMapMirror.put(movieId, movieIdCounter);
                    movieIdCounter++;
                }

                if(userIdsHashMapMirror.containsKey(userId)) {
                    userIdConverted = userIdsHashMapMirror.get(userId);
                } else {
                     //saving the mappings of userId
                    userIdConverted = userIdCounter;
                    userIdsHashMapMirror.put(userId, userIdCounter);
                    userIdCounter++;
                }
                //Rating(user: Int, product(movieId): Int, rating: Double)
                Rating rating = new Rating(userIdConverted, movieIdConverted, Double.parseDouble(movieScore));
                return rating;
            }
        }
);

The DS used to preserve the data:

// saving userId & movieId mappings
private static HashMap<Integer, String> movieIdsHashMap = new HashMap<Integer, String>();
// saving userId & movieId mappings
private static HashMap<String, Integer> movieIdsHashMapMirror = new HashMap<String, Integer>();
private static HashMap<String, Integer> userIdsHashMapMirror = new HashMap<String, Integer>();

private static int movieIdCounter = 0;
private static int userIdCounter = 0;

Collaborative Filtering : http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

You need a central registry of some kind with which to map IDs to integers. A Java `int` is not wide enough to guarantee uniqueness for an algorithmic conversion from ID strings of the lengths you present to integer. — John Bollinger, Jun 23 '16 at 18:54
@JohnBollinger what do you mean by not wide enough, I'm holding a counter and a hashMap, I check if a String exist in the hashMap if not I add it with the current counter value and increase the counter.. how does this dosen't guarantee uniqueness ? — Jay, Jun 23 '16 at 19:26
I mean there are fewer distinct `int`s than there are possible ID strings. Thus, your mapping cannot be embodied by an algorithm with the Id string as its only input if it must be certain never to produce duplicate results for any of the IDs in the system. You need some kind of central registry, as I said, so that the mapping can take into account the IDs that have already been mapped. — John Bollinger, Jun 23 '16 at 19:39
Inasmuch as you already have such a thing in the form of your hash map, and a strategy for using it, I'm uncertain what the point of your question is. — John Bollinger, Jun 23 '16 at 19:41
@JohnBollinger thanks for your response. I've read a lot of articles about that, some people have approached the problem by using ZipWithIndex, after applying distinct() operation on the rdd. anyway thanks for your response. — Jay, Jun 23 '16 at 20:02
@JohnBollinger that solved my issue : http://stackoverflow.com/questions/27772769/how-to-use-mllib-recommendation-if-the-user-ids-are-string-instead-of-contiguous — Jay, Jun 23 '16 at 21:07

Best Practice of mapping String to a unique Integer in distributed mode

0 Answers0