Correctly written it is
Information-gain = entropy-before-split - average entropy-after-split
the difference of entropy vs. information is the sign. Entropy is high, if you do not have much information of the data.
The intuition is that of statistical information theory. The rough idea is: how many bits per record do you need to encode the class label assignment? If you have only one class left, you need 0 bits per record. If you have a chaotic data set, you will need 1 bit for every record. And if the class is unbalanced, you could get away with less than that, using a (theoretical!) optimal compression scheme; e.g. by encoding the exceptions only. To match this intuition, you should be using the base 2 logarithm, of course.
A split is considered good, if the branches have lower entropy on average afterwards. Then you have gained information on the class label by splitting the data set. The IG value is the average number of bits of information you gained for predicting the class label.