Let's say I pick a random source like CNN. Would it be more advantageous to automatically sort scraped articles into categories based on keywords, or scrape individual parts of the website for different categories, i.e. cnn.com/tech or /entertainment, for example. The second option isn't easily scalable, I wouldn't want to manually configure urls for different sources. How does Google News address this issue?

- 8,084
- 8
- 48
- 62

- 768
- 2
- 11
- 25
-
I thought about naive bayes, but I was wondering if there are more sophisticated approaches to solve this. – TheProofIsTrivium Sep 16 '13 at 01:33
-
1Tempted to nominate this as "too broad", but the specific reference to Google News kind of saves it. You are really asking how to solve text categorization. – tripleee Sep 16 '13 at 03:50
-
Huh? I dunno, not mine. – tripleee Sep 16 '13 at 05:49
-
It is actually interesting question, but it won't be answered here. Google rarely shows what models they are actually using. They only publish method at least 5 years "in the past" to ensure noone gest ahead of them. And only looking at their results noone is capable of telling you what exactly are they using. We can elaborate about the generic topic, but then it is too broad for SO. – lejlot Sep 16 '13 at 08:13
-
How is this question any different than this popular question? http://stackoverflow.com/questions/9294926/how-does-apple-find-dates-times-and-addresses-in-emails/9344555#9344555 – Neil McGuigan Sep 16 '13 at 18:59
2 Answers
Here is a Google patent from 2005
"Systems and methods for improving the ranking of news articles"
And an update from 2012:
SYSTEMS AND METHODS FOR IMPROVING THE RANKING OF NEWS ARTICLES
If you wanted to build a simple system yourself, I would do something like this:
Take a bunch of news stories that are already classified into sports/tech/whatever.
Tokenize them into individual words and grams (short sequences of words).
Create a really big table with unique words and grams as the columns and individual stories as the rows:
StoryId Class word1 word2 gram1 gram2 ...
1 sports 0 0.2 0.01 0
2 tech 0.5 0.01 0 0.3
3 sports 0 0.1 0.3 0.01
Where the values in the cells represent the frequency, binary occurrence or TF-IDF scores of the words in the documents.
Use a classification algorithm such as Naive Bayes or Support Vector Machines to learn the weights of the columns with respect to the class labels. This is called your model.
When you get a new, unclassified document, tokenize it the same way as before, apply the model you created earlier, and it will give you the most likely class label of the document.
Here is my video series which includes a video on automatic document categorization:
http://vancouverdata.blogspot.ca/2010/11/text-analytics-with-rapidminer-loading.html

- 46,580
- 12
- 123
- 152
-
2There is very small chance, that google news works on such easy models, so it does not really address the OP question. – lejlot Sep 16 '13 at 19:25
Not sure if the answer if relevant now
Check google's NLP API. They are using hierarchical classification Close to 800 classes.
Here is a list of categories they support

- 6,328
- 7
- 36
- 55