0

I've created an algorithm which weighs the relevance of a list of articles against two lists of keywords that correlate to attributes of the article.

It works great and is super efficient... but it's a mess. It's not terribly readable, so it's difficult to discern what's going.

The operation in pseudo code goes something like this:

  • Loop through every article in a list called articles(List<Article>)
  • For every article, loop through every role in a list of roles (List<string>)
  • Check to see if the current article has any roles (Article.Roles = List<string>)
  • If yes, then loop through each role in the article and try to match a role in the article to the role in the current loop
  • If a match is found, add weight to the article. If the index of the role on the article and the role in the roles list are both index 0 (in primary position) add extra weight for two matching primaries
  • Repeat for topics, but with no bonus for primary matches

What would be a better way to write the following code? I can't use foreach except in one or two places, because I need to match indexes to know what value to add on a match.

private static List<Article> WeighArticles(List<Article> articles, List<string> roles, List<string> topics, List<string> industries)
{
    var returnList = new List<Article>();
    for (int currentArticle = 0; currentArticle < articles.Count; currentArticle++)
    {
        for (int currentRole = 0; currentRole < roles.Count; currentRole++)
        {
            if (articles[currentArticle].Roles != null && articles[currentArticle].Roles.Count > 0)
            {
                for (int currentArticleRole = 0; currentArticleRole < articles[currentArticle].Roles.Count; currentArticleRole++)
                {
                    if (articles[currentArticle].Roles[currentArticleRole].ToLower() == roles[currentRole].ToLower())
                    {
                        if (currentArticleRole == 0 && currentRole == 0)
                            articles[currentArticle].Weight += 3;
                        else
                            articles[currentArticle].Weight += 1;
                    }
                }
            }
        }
        for (int currentTopic = 0; currentTopic < topics.Count; currentTopic++)
        {
            if (articles[currentArticle].Topics != null && articles[currentArticle].Topics.Count > 0)
            {
                for (int currentArticleTopic = 0; currentArticleTopic < articles[currentArticle].Topics.Count; currentArticleTopic++)
                {
                    if (articles[currentArticle].Topics[currentArticleTopic].ToLower() == topics[currentTopic].ToLower())
                    {
                        articles[currentArticle].Weight += 0.8;
                    }
                }
            }
        }
        returnList.Add(articles[currentArticle]);
    }

    return returnList;
}

//Article Class stub (unused properties left out)
public class Article
{
    public List<string> Roles { get; set; }
    public List<string> Topics { get; set; }
    public double Weight { get; set; }
}
Wesley
  • 5,381
  • 9
  • 42
  • 65
  • 5
    http://codereview.stackexchange.com/ – Sergey Berezovskiy Aug 22 '14 at 21:48
  • I'm just guessing here: This pattern looks like you need recursion – Linial Aug 22 '14 at 21:49
  • 2
    @Linial: not recursion but refactoring – Tim Schmelter Aug 22 '14 at 21:53
  • Remove all that and use proper LINQ. C# is not java. – Federico Berasategui Aug 22 '14 at 21:55
  • @HighCore I'm decent at refactoring `foreach` loops into c#, but I can't find a way to get the indexes in Linq. – Wesley Aug 22 '14 at 21:56
  • @Wesley forget indexes. You're dealing with data (as in higher-level OOP concept) not some sort of array (as in lower-level C-like memory stuff). You don't need indexes or anything like that. You need to write proper functions that operate on your data. – Federico Berasategui Aug 22 '14 at 21:58
  • @HighCore But without indexes, I can't tell when I'm dealing with `Article.Roles[0]` and `Roles[0]` which is a major problem, seeing as the Role in first position for both the `Article.Roles` and `Roles` have more importance than subsequent matches down the list. – Wesley Aug 22 '14 at 22:02
  • @Wesley then use `.First()` or something like that dude... forget indexes. I'm still trying to figure out what the hell your code really does so I can write a proper answer... – Federico Berasategui Aug 22 '14 at 22:03
  • @HighCore I'll take a swing with first() as well, then, but I'm assuming someone on here can do this in half as many lines as I can, which is why I asked for help :) – Wesley Aug 22 '14 at 22:06

3 Answers3

2

Okay, you have several design flaws in your code:

1 - It's too procedural. You need to learn to think to write code to tell the machine "what you want" as opposed to "how to do it", similar to the analogy of going to a bar and instructing the bartender about the exact proportions of everything instead of just asking for a drink.

2 - Collections Should NEVER be null. Which means that checking for articles[x].Roles != null makes no sense at all.

3 - iterating on a List<string> and comparing each with someOtherString makes no sense either. Use List<T>.Contains() instead.

4 - You're grabbing each and every one of the items in the input list and outputting them in a new list. Also nonsense. Either return the input list directly or create a new list by using inputList.ToList()

All in all, here's a more idiomatic C# way of writing that code:

private static List<Article> WeighArticles(List<Article> articles, List<string> roles, List<string> topics, List<string> industries)
{
    var firstRole = roles.FirstOrDefault();

    var firstArticle = articles.FirstOrDefault();

    var firstArticleRole = firstArticle.Roles.FirstOrDefault();

    if (firstArticleRole != null && firstRole != null && 
        firstRole.ToLower() == firstArticleRole.ToLower())
        firstArticle.Weight += 3;

    var remaining = from a in articles.Skip(1)
                    from r in roles.Skip(1)
                    from ar in a.Roles.Skip(1)
                    where ar.ToLower() == r.ToLower()
                    select a;

    foreach (var article in remaining)
        article.Weight += 1;

    var hastopics = from a in articles
                    from t in topics
                    from at in a.Topics
                    where at.ToLower() == t.ToLower()
                    select a;

    foreach (var article in hastopics)
        article.Weight += .8;

    return articles;
} 

There are even better ways to write this, such as using .Take(1) instead of .FirstOrDefault()

Community
  • 1
  • 1
Federico Berasategui
  • 43,562
  • 11
  • 100
  • 154
2

If you'll examine your code, you'll find that you are asking Article class to many times for data. Use Tell, Don't Ask principle and move weight adding logic to Article class, where it should belong. That will increase cohesion of Article, and make your original code much more readable. Here is how your original code will look like:

 foreach(var article in articles)
 {
     article.AddWeights(roles);
     article.AddWeights(topics);
 }

And Article will look like:

 public double Weight { get; private set; } // probably you don't need setter

 public void AddWeights(IEnumerable<Role> roles)
 {
     const double RoleWeight = 1;
     const double PrimaryRoleWeight = 3;

     if (!roles.Any())
        return;

     if (Roles == null || !Roles.Any())
         return;

     var pirmaryRole = roles.First();
     var comparison = StringComparison.CurrentCultureIgnoreCase;

     if (String.Equals(Roles[0], primaryRole, comparison))
     {
         Weight += PrimaryRoleWeight;
         return;
     }

     foreach(var role in roles)         
        if (Roles.Contains(role, StringComparer.CurrentCultureIgnoreCase))
            Weight += RoleWeight;
 } 

Adding topics weights:

 public void AddWeights(IEnumerable<Topic> topics)
 {
     const double TopicWeight = 0.8;

     if (Topics == null || !Topics.Any() || !topics.Any())
        return;

     foreach(var topic in topics)         
        if (Topics.Contains(topic, StringComparer.CurrentCultureIgnoreCase))
            Weight += TopicWeight;
 }
Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
1

Use the Extract Method refactoring on each for loop and give it a semantic name WeightArticlesForRole, WeightArticlesForTopic, etc. this will eliminate the nested loops(they are still there but via function call passing in a list).

It will also make your code self documenting and more readable, as now you have boiled a loop down to a named method that reflects what it accomplishes. those reading your code will be most interested in understanding what it accomplishes first before trying to understand how it accomplishes it. Semantic/conceptual function names will facilitate this. They can use GoTo Definition to determine the how after they udnerstand the what. Provide the summary tag comment for each method with elaborated explanation(similar to your pseudo code) and now others can wrap their head around what your code is doing without having to tediously read code they aren't concerned with the implementation details of.

The refactored methods will likely have some dirty looking parameters, but they will be private methods so I generally don't worry about this. However, sometimes it helps me see what dependencies are there that should probably be removed and restructure the code in the call such that it can be reused from multiple places. I suspect with some params for the weighting and delegate functions you might be able to combine WeightArticlesForRole and WeightArticlesForTopic into a single function to be reused in both places.

AaronLS
  • 37,329
  • 20
  • 143
  • 202