Sunday, August 31, 2008

Collection sorting – when the built-in stuff doesn’t work

C# and the the .Net framework has a lot of excellent support for collections – no more mucking about creating your own linked list structures, hash tables etc. This means you get easy to use lists of your stuff almost for free, saving you a lot of testing/debugging whenever you make a brain fart during tedious boilerplate coding.

The network also has a lot in place for sorting these collections – the Sort() function allows you to plug in your own comparer and will then perform a relatively performant sort for you (QuickSort? I dunno). Isn’t it nice that we can skip this stuff and concentrate on getting the logic unique to our application right?

Well, yes. However, there are cases where the default sort does not work. I recently encountered one of those in one of my applications.

This application keeps a bunch of tags. You know, the del.icio.us kind, that you can use to label your data in different ways. We can mark certain products with a tag, and then, when we search for the tag, we’ll find all producst tagged by it. Simple stuff.

One particular feature here is that each tag keeps a list of “implied tags” – i.e. if you add this tag, you also implicitly have these tags as well. An example will hopefully clarify;

If you have the products ham, bacon and sirloin, you might want a tag for the first two named pork, and a tag for the last named beef. In addition, you would typically have a tag named meat that is implied both by pork and beef. That way, if you search for products that contain meat, you will find any products that contain ham, bacon or sirloin.

This is a simple, contrived example, but it should illustrate how implied tags work. It also leads to some limitations on the implementation of this system. Data in this case is to be stored in an XML file, and will be fetched in such a way that when a tag references (implies) another tag, the implied tag will already have to be defined; I.e. an implied tag has to come before the tag doing the implying in the XML file.

In my case, recursive tag implication is not allowed. Tag A can not imply tag B which in turn implies tag A, or any deeper variation of that. Therefore, the solution is simple; In the XML file, we need to make sure that tags that don’t imply anything else come first (there will always be at least one of these, because of the limitation on recursion). Then come the tags that only imply one or more of these first tags, and then in turn the tags that imply the tags we already have. Basically, a tag can only be added to the file once all tags it implies are already in the file.

My first attempt to take care of this was a relatively boneheaded one in retrospect; I implemented a comparer, implementing IComparer<Tag> to figure out the sort order, based on wether or not tag X implies tag Y, or Y implies X (and if they both implied each other, I threw – that’s tag implication recursion right there). Now, in theory this would work, but unfortunately the List.Sort() method was a little to clever for me. The result of one single comparison in my case only says something about the relation between these two elements. However, for the Sort() method, if A comes before B, and B comes before C, then we don’t even bother comparing A and C – we’ve already worked that part out implicitly. It doesn’t take a lot of imagination to come up with a situation where a tag list wouldn’t be sorted in the right order with this kind of sorting. And a quick unit test or two easily demonstrated that this would not handle all cases:

[Test]
public void TestImplicationSortingCreatesCorrectOrder()
{
var t1 = new Tag("Level 1");
var t2 = new Tag("Level 2");
var t3 = new Tag("Level 3");

t2.ImpliedTags.Add(t1);
t3.ImpliedTags.Add(t2);

var list = new List<Tag>() { t3, t2, t1 };
var newlist = Tag.SortTagListByImplication(list);

Assert.AreEqual(t1, newlist[0]);
Assert.AreEqual(t2, newlist[1]);
Assert.AreEqual(t3, newlist[2]);
}


This test would easily fail using the built-in sorting. Tag.SortTagListByImplication( IEnumerable<Tag>) was just a static wrapper that created the correct comparer and performed the sort.

To get this to work properly, however, I had to implement the sorting myself, so I chose to do it in the same method:


public static List<Tag> SortTagListByImplication(IEnumerable<Tag> tags)
{
if (tags == null)
throw new ArgumentNullException();

if (tags.Count() == 0)
return new List<Tag>(tags);

var newlist = new List<Tag>();
var sourceitemlist = new List<Tag>(tags);

do
{
var newbatch = new List<Tag>();
foreach (var t in sourceitemlist)
{
var hasAnyUnresolvedImplications = false;

foreach (var it in t.ImpliserteTags)
{
if (!newlist.Contains(it))
hasAnyUnresolvedImplications = true;
}
if (!hasAnyUnresolvedImplications)
newbatch.Add(t);
}
if (newbatch.Count == 0)
throw new InvalidOperationException("No new Tag items added to list - recursion nightmare!");
newlist.AddRange(newbatch);
foreach (Tag t in newbatch)
sourceitemlist.Remove(t);

}
while (sourceitemlist.Count > 0);

return newlist;
}
This code is very much a “first pass” – i.e. I stopped working on it once the unit test started working. Yes, I am quite sure it can be improved. However, right now, it does what I need it to do, so I will leave it alone.

The point of this post? I just wanted to mention that there are situations where the built in sorting logic is not proper. If it helps someone in the future, great. :)

No comments: