Sunday, January 15, 2012

Autocorrecting unknown actions using the Levenshtein distance

This weekend I prototyped an idea I had earlier this week: autocorrecting unknown actions in ASP.NET MVC.

Handling unknown actions

To give you an example, let's say I have a Home controller with an action named Kitten on it. If there is an incoming route for the Home controller with Kitty (instead of Kitten) as the action name, the controller will not be able to invoke any action method and instead will call the HandleUnknownAction method.

Here is the snippet from the ASP.NET MVC source.
protected override void ExecuteCore() {
    PossiblyLoadTempData();
    try {
        string actionName = RouteData.GetRequiredString("action");
        if (!ActionInvoker.InvokeAction(ControllerContext, actionName)) {
            HandleUnknownAction(actionName);
        }
    }
    finally {
        PossiblySaveTempData();
    }
}
The HandleUnknownAction is virtual, meaning we can override it in our derived controller. The base implementation of the HandleUnknownAction method does nothing more than throwing a 404 HttpException.
protected virtual void HandleUnknownAction(string actionName) {
    throw new HttpException(404, String.Format(CultureInfo.CurrentCulture,
        MvcResources.Controller_UnknownAction, actionName, GetType().FullName));
}
So let's override the HandleUnknownAction method and try to autocorrect the unknown action name. To be safe, we will only attempt to autocorrect the action name when it's a GET HTTP request.
protected override void HandleUnknownAction(string actionName)
{
    if (!HttpContext.Request.HttpMethod.Equals("GET", StringComparison.OrdinalIgnoreCase))
        Throw404HttpException(actionName);
    
    TryToRedirectToAnActionNearby(actionName);           
}
Listing all actions

First we need a list of all available action names. I reflect on the methods of the current controller and select the methods which are public, can be invoked and are an instance method. Also the method should return an ActionResult, not be decorated with the HttpPost attribute and not have a special name. I'm pretty sure I'm missing a few things here, but there seems to be no generic way to extract this metadata from a controller. Places where these type of things are used in the framework seem to be internal or non-public.
private IEnumerable<string> GetAllHttpGetActionNames()
{
    return GetType()
            .GetMethods(BindingFlags.InvokeMethod | 
                        BindingFlags.Public | 
                        BindingFlags.Instance)
            .Where(m => m.ReturnType == typeof(ActionResult) &&
                        !m.IsSpecialName &&
                        !m.GetCustomAttributes(true)
                            .Contains(typeof(HttpPostAttribute)))
            .Select(m => m.Name)
            .Distinct();
}
Once we have all these action names, we want to see how distant they are from the unknown action name we are trying to autocorrect here. To calculate this we can use the Levenshtein distance algorithm.

The Levenshtein distance

The Levenshtein distance is defined by Wikipedia like this.
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.
An implementation of this algorithm in C# could look like this.
public static int CalculateDistance(string str1, string str2) 
{
    var matrix = new int[str1.Length + 1, str2.Length + 1];

    for (var i = 0; i <= str1.Length; i++)
        matrix[i, 0] = i;
    for (var j = 0; j <= str2.Length; j++)
        matrix[0, j] = j;

    for (var i = 1; i <= str1.Length; i++)
    {
        for (var j = 1; j <= str2.Length; j++)
        {
            var cost = str1[i - 1] == str2[j - 1] ? 0 : 1;

            matrix[i, j] = (new[]
            {
                matrix[i - 1, j] + 1, matrix[i, j - 1] + 1, matrix[i - 1, j - 1] + cost
            }).Min();

            if ((i > 1) && 
                (j > 1) && 
                (str1[i - 1] == str2[j - 2]) &&
                (str1[i - 2] == str2[j - 1]))
            {
                matrix[i, j] = Math.Min(matrix[i, j], matrix[i - 2, j - 2] + cost);
            }
        }
    }

    return matrix[str1.Length, str2.Length];
}        
This is a direct port from the pseudocode found on Wikipedia. These tests might, probably a lot more than the implementation, help you understand what the Levenshtein algorithm calculates.
[TestMethod()]
public void Test_CalculateDistance_With_Two_Empty_String()
{           
    Assert.AreEqual(0, Levenshtein.CalculateDistance(string.Empty, string.Empty));
}   

[TestMethod()]
public void Test_CalculateDistance_With_Empty_First_String()
{         
    Assert.AreEqual(6, Levenshtein.CalculateDistance(string.Empty, "kitten"));
}

[TestMethod()]
public void Test_CalculateDistance_With_Empty_Second_String()
{           
    Assert.AreEqual(6, Levenshtein.CalculateDistance("kitten", string.Empty));
}

[TestMethod()]
public void Test_CalculateDistance_With_Missing_Characters()
{           
    Assert.AreEqual(2, Levenshtein.CalculateDistance("kitten", "kitt"));
}

[TestMethod()]
public void Test_CalculateDistance_With_Wrong_Characters()
{           
    Assert.AreEqual(1, Levenshtein.CalculateDistance("kitten", "kittyn"));
}

[TestMethod()]
public void Test_CalculateDistance_With_Too_Much_Characters()
{           
    Assert.AreEqual(5, Levenshtein.CalculateDistance("kitten", "kittenkitty"));
}

[TestMethod()]
public void Test_CalculateDistance_With_Equal_Strings()
{          
    Assert.AreEqual(0, Levenshtein.CalculateDistance("kitten", "kitten"));
}
Now that we have implemented the Levenshtein distance algorithm, we can calculate the distances between the unknown action name and all the available action names.
private Dictionary<string, int> CalculateLevenshteinDistance(IEnumerable<string> actionList, string actionName)
{
    return actionList
            .Select(a => new
            {
                Action = a.ToLower(),
                Distance = Levenshtein.CalculateDistance(a.ToLower(), actionName.ToLower())
            })                    
            .ToDictionary(k => k.Action, v => v.Distance);
}
For the unknown action name 'Kitty', when the action names 'Kitten', 'Index' and 'Dog' are available, this method would return a dictionary that looks like this.
'kitten' : 2
'index' : 6
'dog' : 6
Putting it all together

Now we have this dictionary, we want to filter on a certain distance threshold. I picked three, given that when a word is three characters off, the chance of it being a typo is rather small.

If the dictionary still contains some items after filtering, we want to take the action with the shortest distance, this action is the nearest to the unknown action. Only thing left to do is change the action in the RouteData and execute a RedirectResult. The easiest way to generate a url to redirect to, is to use the controller's UrlHelper to let it generate the url based on the RouteData.
private void TryToRedirectToAnActionNearby(string actionName)
{
    var httpGetActionNames = GetAllHttpGetActionNames();
    if (!httpGetActionNames.Any())
        Throw404HttpException(actionName);

    var actionDistanceMap = CalculateLevenshteinDistance(httpGetActionNames, actionName)
                                .Where(i => i.Value <= 3);
    if (!actionDistanceMap.Any())
        Throw404HttpException(actionName);

    var shortestDistance = actionDistanceMap.Select(v => v.Value).Min();
    var nearestAction = actionDistanceMap.Where(i => i.Value == shortestDistance).First().Key;

    ControllerContext.RouteData.Values["action"] = nearestAction;

    new RedirectResult(Url.RouteUrl(RouteData.Values), permanent: true)
        .ExecuteResult(ControllerContext);
}
The outcome

Now when I, the user, type http://somesite/someController/kitty, I will be redirected to http://somesite/someController/kitten without me even noticing.



Feedback

This implementation definitely is not production ready. It's a prototype, not even under test. I wonder if this  is even something you would want to do. Or is this breaking the Web in one way or the other? Would it bother search engines?

15 comments:

  1. As long as you use redirects, it does not break the web and should not bother search engines. Great idea under some circumstances.

    ReplyDelete
  2. This is really awesome and clever, I love it. I probably wouldn't do it, since it introduces a lot of ambiguity on the user's side as to what's correct and what isn't, and it can lead to a lot of complications and conflicts, but it's awesome nonetheless.

    ReplyDelete
  3. I would implement it as a 404 page but show one or more "did you mean" links. That way users will know it is incorrect but have a convenient way to get where they intended.

    ReplyDelete
  4. Very interesting idea!

    I was going to post something very clever, but Ken Snyder beat me to it :-)

    ReplyDelete
  5. Thanks for the feedback!

    @Ken, Kristof: Yep, I thought of that as well. I think that implementation might even be more expected 'web behavior'.

    ReplyDelete
  6. This is a clever way to cover up an accidental fault. However, I think you should show a 404 page and show the top ranking "did you mean" links. That way users know it's a bad URL but have an easy way to move forward to the right place.
    Auto forwarding to the top scoring link might not be the right one in all cases, and covers up user faults too.

    ReplyDelete
  7. I have used the Levenstein algorithm many times before (in conjunction with SOUNDEX) and as far as I can remember it's really heavy - your implementation of the algorithm seems easy though!

    ReplyDelete
  8. I'm with Ken and Simon on this. The implementation you show is pretty snazzy, but I'd rather use the Levenshtein distance to show "did you mean" links on the 404 page.

    ReplyDelete
  9. This is a very clever idea. As others have already said it's better to use the links for showing them on the 404 page.

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. Thank you so much. i had a requirement a little while back to find similar file name. I have just tested this and it is working nicely.

    ReplyDelete
  12. This comment has been removed by a blog administrator.

    ReplyDelete
  13. One big reason you don't want to autocorrect is that HTTP requests are used by machines as well as people. If you autocorrect, you will suddenly find yourself on the evil side of Postel's law, because you are now accepting (and therefore encouraging) something which is actually incorrect.

    So, when you add another page/resource/whatever, and your autocorrection changes for some particular mispelling, everyone who was relying on it will now have breakage on their hands - a breakage they would have detected earlier if you had returned a 404. So you will end up hard coding support for your old corrections to not break things for your clients...

    This is one case where it is much better to fail early. There are countless other examples: autodetection of web page encoding, 'helpfully' added to browsers years ago to help lazy web authors, but which can never work perfectly and which we are stuck with forever - since some browsers implemented this, all the others have to in order to compete and now there are billions of web pages that don't have an encoding declared, and require autodetection.

    The Python maxim helps here: "In the face of ambiguity, refuse the temptation to guess".

    ReplyDelete
  14. Use the Demereau (sp?) Levenshtein distance alogrithm. You'll get better results.

    ReplyDelete
  15. Thanks for sharing your info. I really appreciate your efforts and I will be waiting for your further write ups thanks once again.

    ReplyDelete