Skip Navigation Linkswww.mathertel.de --> An O(ND) Difference Algorithm for C# --> Diff Documentation

An O(ND) Difference Algorithm for C#

This article is about comparing text files and the proven, best and most famous algorythm to identify the differences between them. The source code that you can find in the download implements a small class with a simple to use API that just does this job. You should have it in the bag of your algorythms.

Beside the class that implements the algorythm there is also a sample web application that compares 2 files and generates html output with a combined and colored document.

The algorythm was first published 20 Years ago under "An O(ND) Difference Algorithm and its Variations" by Eugene Myers Algorithmica Vol. 1 No. 2, 1986, p 251 . You can find a copy if it at http://xmailserver.org/diff2.pdf.  In this article you can find a abstract recursive definition of the algorythm using some pseudo-code that needs to be transferred to a existing programming language.

There are many C, Java, Lisp implementations public available of this algorythm out there on the internet. Before I wrote the C# version I discovered that almost all these implementations seem to come from the same source (GNU diffutils) that is only available under the (unfree) GNU public license and therefore cannot be reused as a source code for a commercial or redistributable application without being trapped by the GNU license.

There are very old C implementations that use other (worse) heuristic algorithms. Microsoft also published source code of a diff-tool (windiff) that uses some tree structures. Also, a direct transfer from a C source to C# is not easy because there is a lot of pointer arithmetic in the typical C solutions and I wanted a managed solution. I tried a lot sources but at least found no usable solution written for the .NET platform.

These are the reasons why I implemented the original published algorithm from the scratch and made it available without the GNU license limitations under a BSD style license. The history of this implementation is back to 2002 when I published a Visual Studio add-in that also can compare files, see http://www.codeproject.com/KB/macros/WebReports8.aspx. I found no more bugs in the last 3 years so I think that the code is stable.

I did not need a high performance diff tool. I will do some performance tweaking when needed, so please let me know. I also dropped some hints in the source code on that topic.

How it works (briefely)

You can find a online working version at http://www.mathertel.de/.

The API

To use this functionality you only have to call one of the DiffText methods. They all get a pair of strings and return an array of items that describe the inserts and deletes between the 2 strings. There are no "changes" reported. Instead you can find a "insert" and "deleted" pair.

DiffText(string TextA, string TextB)

Find the difference in 2 texts, comparing by textlines without any conversion. A array of Items containing the differences is returned.

DiffText(string TextA, string TextB, bool trimSpace, bool ignoreSpace, bool ignoreCase)

Find the difference in 2 texts, comparing by textlines with some optional conversions. A array of Items containing the differences is returned.

Diff(int[] ArrayA, int[] ArrayB)

Find the difference in 2 arrays of integers. A array of Items containing the differences is returned.

A Sample application for the algorithm

The sample is a small web application that calculates the difference of 2 text-file and displays the result using HTML.

To setup the sample you have to create a web-project and copy all files found in the zip file into the directory. The implentation of the algorythm is inside the "diff.cs" class.

Further possible optimizations (if you really need speed)

(first rule: don't do it; second: don't do it yet)

The arrays DataA and DataB are passed as parameters, but are never changed after the creation so they can be members of the class to avoid the parameter overhead.

In SMS is a lot of boundary arithmetic in the for-D and for-k loops that can be done by increment and decrement of local variables.

The DownVector and UpVector arrays are always created and destroyed each time the SMS gets called.

It is possible to reuse them when transferring them to members of the class.

See TODO: hints.

Security issues with the web application:

Changes:

This work was first published at http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid= 96065252-BE1D-4E2F-B7CB-3695FEB0D2C3.

2002.09.20

There was a "hang" in some situations.

Now I undestand a little bit more of the SMS algorithm.

There have been overlapping boxes; that where analyzed partial differently. One return-point is enough.

A assertion was added in CreateDiffs when in debug-mode, that counts the number of equal (no modified) lines in both arrays. They must be identical.

2003.02.07

Out of bounds error in the Up/Down vector arrays in some situations.

The two vectors are now accessed using different offsets that are adjusted using the start k-Line.

A test case is added.

2003.04.09

Another test that throwed an exception was found, but already seems to be fixed by the 2002.09.20 work.

2006.03.10

Refactored the API to static methods on the Diff class to make usage simpler. The Self test is now using the standard Debug class.

License

This work is licensed under a BSD style license.

Built-in Self-TEST

The class has now a built-in self-test, that's working without changing the code. If you compile the diff and debug class files together you get a exe file that's testing some simple diff-scenarios that where used to discover the bugs in the Version 1 and 2 (OutOfArrayBounds typically).

The compile command is: C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\csc /target:exe /out:diffTest.exe /d:DEBUG /d:TRACE /d:SELFTEST Diff.cs

Thanks to all the feedback and the 2 cases that did not diff correctly so I could provide you quality. (It was always my fault, not a problem of the algorythm).

Matthias Hertel, http://www.mathertel.de.