Monday, August 20, 2012

Use case for mutable objects in Clojure

I'm currently implementing Aho-Corasic dictionary indexing algorithm. For that I have to write an augmented trie, where each node can point to another branch of the trie. Such links are called failure links. For example, let's say a trie contains two strings: "abcd" and "fgabch". Here letters in the second string (starting from "a") will have failure links to corresponding letters of the first word.

Here's my ASCII "art":

a-b-c-d
^_____
      |
f-g-a-b-c-h

a-b-c-d
  ^___
      |
f-g-a-b-c-h

a-b-c-d
    ^___
        |
f-g-a-b-c-h

This is needed to guarantee matching in time proportional to the length of the text. If the text is "fgabcd" then matching algorithm will match all letters in "fgabc" and will fail to match last "d", then it will use failure link fgabc -> abc and continue matching from there. As the result it will report matching word with start and end indices [2,5].

Without failure links algorithm would have to restart matching from "g" finding nothing, then again from "a" where it would succeed. This property of never going back is crucial in complexity analysis and gives so much desired O(n) execution time.

Now, I have tried hard to implement this trie using immutable data structures. Following description from the book I need to first construct simple trie with all the words in it and then traverse it starting from root in breadth-first order to compute failure links.

Problem started to emerge when I realized that failure link may point to its parent node, for example when indexing string "aaa", second letter "a" would have failure link to the first letter "a", its parent. As it is impossible to create cycles using constructor-based initialization, immutable collections initialized only in constructors wouldn't work.

I know about delay primitive and how it can be used to create cycles. Korma library makes a very nice use of it. But in my case it is prohibitively expensive (citation needed). This algorithm may be used to index thousands of strings, it is not practical to create tens of thousands of delay objects.

To be clear, I'm not trying to argue against immutable data structures. It's just that sometimes we have to venture into the realm of mutability to achieve efficient code. And I think transients in Clojure are some sort of proof of that.

No comments: