8

Has there been any research in the field of data-mining regarding classifying data which has a one to many relationship?

For example of a problem like this, say I am trying to predict which students are going to drop out of university based on their class grades and personal information. Obviously there is a one to many relationship between the students personal information and the grades they achieved in their classes.

Obvious approaches include:

  1. Aggregate - The multiple records could be aggregated together in some way reducing the problem to a basic classification problem. In the case of the student classification, the average of their grades could be combined with their personal data. While this solution is simple, often key information is lost. For example what if most students who take organic chemistry and get below a C- end up dropping out even if their average is above a B+ rating.

  2. Voting - Create multiple classifiers (often weak ones) and have them cast votes to determine the overall class of the data in question. This would be like if two classifiers were built, one for the student's course data and one for their personal data. Each course record would be passed to the course classifier and based on the grade and the course name, the classifier would predict whether the student would drop out using that course record alone. The personal data record would be classified using the personal data classifier. Then all the class record predictions along with the personal information record prediction would be voted together. This voting could be done in a number of different ways, but most likely would take into account how accurate the classifiers are and how certain the classifier was of the vote. Clearly this scheme allows for more complicated classification patterns than aggregation, yet there is a lot of extra complexity involved. Also if the voting is not performed well, accuracy can easily suffer.

So I am looking for other possible solutions to the classification of data with a one to many relationship.

Nixuz
  • 3,439
  • 4
  • 39
  • 44

4 Answers4

2

Why wouldn't you treat each grade as a separate feature of the same model?

student['age'] = 23
student['gender'] = 'male'
 ... 
student['grade_in_organic_chemistry'] = 'B+'
student['grade_in_classical_physics'] = 'A-'

I guess I'm not seeing why you would want to "aggregate" or join together multiple classifiers when the grades can just be distinct features?

(Please excuse the lame psuedocode above, but just trying to demonstrate my point)

Joe Holloway
  • 28,320
  • 15
  • 82
  • 92
  • Sorry if I didn't make this clear but not all students take the same courses. So either we would be left with lots off null values in the record or the records would not be standard for our classified using your solution. – Nixuz Jan 21 '11 at 22:43
  • I guess what's not clear to me is what do you mean by "record"? Are you asking about how to store this student model in a RDBMS or how to model a student's feature set for classification? If it's the latter, I don't know why the feature set would have to be standard across all students. Some students will have the feature 'grade_in_organic_chemistry', others will not. The classification engine would be designed to understand that certain features are optional and likely even use that information to classify with. – Joe Holloway Jan 21 '11 at 23:06
  • +1 because you should try the straightforward approach first. Lots of null values may not be a problem - it isn't in the bag-of-words model in NLP if you use the right algorithm. SVMs work fine with sparse, high dimensional inputs. – Stompchicken Jan 24 '11 at 09:45
1

While this is probably sub-optimal compared to specialized methods, you could probably use an SVM with correction for unbalanced class as in the following example (using the Python library scikit-learn):

http://scikit-learn.sourceforge.net/auto_examples/svm/plot_weighted_classes.html

In practice, I have had good results with fairly unbalanced classes.

Gael Varoquaux
  • 2,466
  • 2
  • 24
  • 12
0

It's hard to say without knowing more, but from the Bayesian perspective, you may be interested in the case of missing features. I'll discuss in general terms. For more, see [Duda and Hart, 2nd ed., pp. 54-55].

For any classifier, the Bayes decision rule is to choose class i which maximizes the probability of class i occurring given that the data x was observed, i.e., max P(i|x). The vector x contains features, e.g., a student's grades, age, etc.

Not all students take the same classes, so the feature vector x may have empty elements, i.e., "missing features". In that case, you must marginalize over the missing features, i.e., just sum over the missing features, and then make a decision upon the good, remaining features.

Example. Suppose a student took biology, but not chemistry:

P(student drops out | A+ in biology) 
= P(student drops out, A+ in biology)/P(A+ in biology) 
= P(student drops out, A+ in biology, A in chemistry)
  ---------------------------------------------------
  P(A+ in biology, A in chemistry) 
  + 
  P(student drops out, A+ in biology, B in chemistry)
  ---------------------------------------------------
  P(A+ in biology, B in chemistry) 
  + ... + 
  P(student drops out, A+ in biology, F in chemistry)
  ---------------------------------------------------
  P(A+ in biology, F in chemistry)
Steve Tjoa
  • 59,122
  • 18
  • 90
  • 101
  • While I think this method will work in some cases, like the example problem I gave, I think it may run into trouble if the student's course records had a large number of attributes. If that is the case then there would be a large number of null values (missing features) in each record. For example if each course had 100 attributes then we could potentially end up with thousands of null values and the curse of dimensionality may kill us. – Nixuz Jan 22 '11 at 08:25
  • Second, what about cases where n is unbounded? For example, say students can retake classes as many times as they want. It is unclear how we would handle this case, as there is no way we could establish all the possible features as there is an infinite number of them. – Nixuz Jan 22 '11 at 08:28
0

I envision two basic paths forward:

  1. As you call it, the "aggregate" solution, which would utilize various summaries of each student's situation: how many classes were taken, what percent of classes were introductory 101 classes, average grade, lowest quartile grade, etc.

  2. Some type of evidence accumulator, such as a naive Bayes model (as already suggested by Steve) or a fuzzy logic rule base. Such solutions naturally handle varying amounts of incoming data. I suppose this could be achieved with enough data, using one giant conventional model (neural network, etc.) and a very large set of inputs (most of which would be set to a neutral value for "missing"), but I doubt it would work as well as other options.

Sorry, but I think the "gang of simple solutions" would be weak in this particular case. That's not to say that it wouldn't work, but I'd start somewhere else.

Predictor
  • 984
  • 6
  • 9