3

I am attempting to model a realistic social network (Facebook). I am a Computer Science Graduate student so I have a grasp on basic data structures and algorithms.

The Idea:
I began this project in java. My idea is to create multiple Areas of Users. Each User in a given area will have a random number of friends with a normal distribution around a given mean. Each User will have a large percentage or cluster of "Friends" from the Area that they belong to. The remainder of their "Friends" will be smaller clusters from a few different random Areas.

Initial Structure
I wanted to create an ArrayList of areas
ArrayList<Area> areas
With each Area holding an ArrayList of Users
ArrayList<User> users
And each User holding an ArrayList of "Friends"
ArrayList<User> friends

From there I can go through each Area, and each User in that Area and give that user most of their friends from that Area, as well as a few friends from a few random Areas. This is easy enough as long as my data set remains small.

The problem:
When I try to create large data sets, I get an OutOfMemoryError due to no more memory in the heap. I now realize that this way of doing it will be impossible if I want to create, say, 30 Area's with 1 millions users per area, and 200 friends per User. I eat up almost 2gb with 1 Area...So now what. My algorithm would work if I could create all the users ahead of time, then simply "give" friends to each user. But I need the Areas and Users created first. There needs to be a User in an Area before it can be made a "friend".

Next Step:
I like my algorithm, it is simple and easy to understand. What I need is a better way to store this data, since it cant be stored and held in memory all at once. I am going to need to not only access the Area a user belongs too, but also a few random areas as well, for each user.

My Questions:
1. What technology/data structure should I be putting this data into. In the end I basically want a User->Friends relationship. The "Area" idea is a way to make this relationship realistic.
2. Should I be using a different language all together. I know that technologies such as Lucene, Hadoop, etc. were created with Java, and are used for large amounts of data...But I have never used them and would like some guidance before I dive into something new.
3. Where should I begin? Obviously I cannot use only java with the data in memory. But I also need to create these Areas of Users before I can give a User a list of Friends.

Sorry for the semi-long read, but I wanted to lay out exactly where I am so you could guide me in the right direction. Thank you to everyone that took the time to read/help me with this topic.

Eddie
  • 179
  • 2
  • 14
  • *"tens of millions"* and "Java objects" are pretty much incompatible. You have two options: still work "in memory" but without using Java objects (Trove may help here, say if you can represent each user by a user ID fitting in an int, then Trove's TIntArrayList **shall** kick Java's ArrayList deep in the butt for it's not using Java objects but efficient primitives). This first option may work for "tens of millions" but probably not "hundreds of millions". Second option: use a DB. Plenty of choice there. But POJOs, you can indeed forget about it. – Gugussee Jan 12 '11 at 17:45
  • Have you looked at using 64-bit Java and set your memory high enough? Not to say this is the correct solution, but this way you can keep everything in memory. – James Black Jan 12 '11 at 17:45
  • 1
    Thank you Gugussee, I will certainly look into Trove's TIntArrayList. Since I would like this as scalable as possible some sort of DB may work best for me. As for a 64-bit vm, while it may work up to a point, I'm not sure how scalable that solution would be. – Eddie Jan 12 '11 at 17:56

4 Answers4

2

You need a searchable storage solution to hold your data (rather than holding it all in memory). Either a relational database (such as Oracle, MySQL, or SQL Server) with an O/RM (such as Hibernate) or a nosql database such as mongodb will work just fine.

Michael Brown
  • 9,041
  • 1
  • 28
  • 37
  • This is very true. The main goal is to have a searchable storage solution to hold the information...after all it needs to be usable. I guess I am unsure how to create the info to insert into this structure without creating the information in the first place. It sounds like a good place to start would be to study the power of Object-relational mapping. I suppose I would create these objects not in memory, but into a structure like you mentioned, then create the friend relationships from there? And wow, I appreciate all the responses from you guys!! – Eddie Jan 12 '11 at 18:19
0
  1. Use a database with some ORM tool[JPA with Hibernate etc.] ,
  2. Load data Lazily, when they are really needed
  3. Unload them when them from Cache/Session when they are not really required or inactive.

Feel comfortable to let me know in case there is any difficulty to understand.

http://puspendu.wordpress.com/

Puspendu Banerjee
  • 2,631
  • 16
  • 19
  • I am guessing that JPA is a java library. I am a database rookie and know only basic SQL. Would you use sql for this, and which flavor? I have also noticed that Hibernate is quite popular in the field, and would be a great thing for me to learn. Thank you for your input. – Eddie Jan 12 '11 at 18:09
  • @Eddie I would recommend you to startup with MySQL or Apache Derby as Database[Pesistent Storage] because they are opensource[read FREE] with numerous features. JPA and Hibernate uses its own query language based on object structure to make developer's life easier with a number of aspects. – Puspendu Banerjee Jan 13 '11 at 17:07
0

There is probably no benefit keeping it all in memory, unless you are planning on using every node in some visual algorithm to display relationships.

So, if you use a database then you can build your relationships, give random demographic information, if you want to model that also, and then it is a matter of just writing your queries.

But, if you do need a large amount of data then by using 64-bit Java then you can set the memory to a much larger number, depending on what is on your computer.

So, once you built your relationships, then you can begin to write the queries to relate the information in different ways.

You may want to look at using Lists instead of Arrays, when sizes are different, so that you aren't wasting memory when you read the data back. I expect that is the main reason you are running out of memory, if you assume that there are 100 users and the largest number of friends for any of these is 50, but most will have 10, then for the vast majority of users you are wasting space, especially when you are dealing with millions, as the pointer for each object will become non-trivial.

You may want to re-examine your data structures, I expect you have some ineffiencies there.

You may also want to use some monitoring tools, and this page may help: http://www.scribd.com/doc/42817553/Java-Performance-Monitoring

Even something as simple as jconsole would help you to see what is going on with your application.

James Black
  • 41,583
  • 10
  • 86
  • 166
  • You are right, I will never need every node. I will be working with friend relationships but not all at once. I guess my problem is creating these relationships without having the Users exist in the first place. – Eddie Jan 12 '11 at 18:13
0

Well you are not breaking new ground here and there are a lot of existing models that you can pull great amounts of information from and tailor to suit your needs. Especially if you are open to the technologies used. I understand your desire to have it fill this huge number from the start but keep in mind a solid foundation can be built upon and changed as needed without a complete rewrite.

There is some good info and many links to additional good info as to what FB, LinkedIn, Digg, and others are doing here at Stackoverflow question 1009025

Community
  • 1
  • 1
swisscheese
  • 1,765
  • 4
  • 23
  • 25