#.think.in
learn.create.enjoy

SQL Sever 2005 Full Text Indexing

November 14, 2008 09:31 by tarn

imageTill now I've always had problems getting SQL Server 2005 Full Text Indexing (FTI) to perform well in real world scenarios. Recently I found it was because I wasn't implementing it correctly, so I'll post this tip which will hopefully help myself and others get it right next time.

The query is a user entered keyword search on text columns using the FTI predicates and T-SQL predicates on other relational columns. The query is over a moderately large number of and returns paged results.

FTI is basically the ability to search from text columns where the text has been indexed with natural language, relevance and ranking considerations built in. XML and other data type can also be index natively, but I haven't used that. FTI in SQL Server 2005 uses the Microsoft Search Service. This service is not actually part of SQL Server itself, it is also used by Exchange and Sharepoint. I think you can use the service directly with a Sharepoint SDK.

To use Full Text Indexing in SQL server you must enable full text indexing to the entire database and the table and columns containing the searchable text. It can all be setup from SQL Management Studio, but it does create a new physical folder in addition to the normal database MDF and LDF files. Once this has been done additional FTI predicates can be used on the selected columns.

As SQL Server Express doesn't support FTI, I think it is worth considering optionally supporting normal T-SQL similar to applications support querying using the FTI and  functions. Its nice to be able develop or even deploy with SQL Server Express even though the text searching is not indexed using the Search Services. But this really depends on circumstances, I just wanted to note supporting both is an option. 

I found a query similar to this example was taking ages over a large amount of data. It was unacceptable, we had to render all pages in less than 4 seconds under moderate load and we expected to get less than two seconds. We we getting search times alone in double figures for some keywords.

SELECT *
FROM
(
      SELECT *, ROW_NUMBER() OVER ([DocumentInformation].PublishDate ASC) as RowNumber,             
      FROM Document
      JOIN DocumentInformation ON DocumentInformation.DocumentId = Document.DocumentId
      JOIN UserAccount Owner ON Document.OwnerId = UserAccount.UserId
      JOIN UserAccount Publisher ON Document.PublisherId = Publisher.UserId
      WHERE [Document].Published = 1 AND     
            [Owner].IsActive = 1 AND      
            [Publisher].IsActive = 1 AND
            CONTAINS([Document].DocumentText, "FTI*" )
) x
WHERE RowNumber BETWEEN 1 AND 10
ORDER BY RowNumber)

 

Below has an additional inner query on the document table alone, using just the CONTAINS predicates on the document text. The additional tables are joined to the results and the normal T-SQL predicates are added. 

SELECT *
FROM
(
    SELECT *, ROW_NUMBER() OVER ([DocumentInformation].PublishDate ASC) as RowNumber,        
    FROM 
        (SELECT *, DocumentId
        FROM dbo.Document
        WHERE CONTAINS(([Documemnt].ProductKeywords, "FTI*" )) p
        JOIN DocumentInformation ON [DocumentInformation].DocumentId = p.DocumentId
        JOIN UserAccount Owner ON Document.OwnerId = UserAccount.UserId         
        JOIN UserAccount Publisher ON Document.PublisherId = Publisher.UserId 
        WHERE [Document].Published = 1 AND 
              [Owner].IsActive = 1 AND 
              [Publisher].IsActive = 1 
) x
WHERE RowNumber BETWEEN 1 AND 10
ORDER BY RowNumber

 

I gave this a go after reading an FTI Best Practice article that indicated the searchable text rows couldn't be excluded from a search due to the search service running in a separate process. The result of this subtle change was amazing, the searches were obviously much quicker and returned the same result set. Under load test we found this change alone improved the average execution time of all pages on the website by a massive 30%.

In addition to this the FTI best practices document also recommends the following optimizations to improve full text performance.

We ended up implementing all the recommendations above to try and get the most out of the FTI service. After all this we got search times down to about a second under light load, this was cool but we had grander plans for an in memory search using of expression trees for search filters and LINQ for querying the in memory objects. I hope to chronicle this in future posts as initial performance testing shows much quicker search times and a significant capacity load increase by removing searching from SQL Server which was previously our performance bottleneck.

In SQL Server 2008 they have apparently integrated  FTI into the SQL Server process which appears to be the cause of my original performance problem. But so far I've only heard bad things, hopefully these issues will be cleaned up before I start using it.

2005 Full-Text Search Architecture

2008 Full-Text Search Architecture


Tags:
Categories:
Comments (0)

Liberation Day - Power To Developers

November 13, 2008 22:40 by tarn

image I'm not in anyway formally associated Microsoft other than that I've been primarily using their tools to develop software for the over the last few years, so I was a little excited I was invited to Liberation Day in Sydney after placing 2nd in DevSta. I thought it would be great to see Steve Ballmer, infamous for this famous clip and this. He is, of course, also the Microsoft CEO and was going to announce Azure the new Microsoft cloud computing platform. He was sure to be entertaining anyway.

As I'd never been to Sydney I decided I could spend a couple of days on Bondi Beach with my girlfriend and drop into the conference and while I was there. I'm expecting to be looking for a new job from next year, probably not in Sydney, but I thought it would also be good to see who was there and meet a few people anyway. I was expecting there would be a couple of DevSta judges there too.

There were heaps of people at the event, I wouldn't be surprised if there actually was the 1000 developers the flyer claimed there would be. It was cool, I can't image an event with that many developers in Melbourne. It was streamed live and can be replayed on the Power To Developers site. I watched Steve get up and do his thing, and it was fun, not rock'n'roll, but fun. Unfortunately I wasn't feeling very well, almost feverish, and I had to duck out to find a chemist.

Gianpaolo Carraro's presentation attempted to demonstrate how easy developing cloud solutions for the Azure was with Visual Studios 2008. I thought it was pretty cool despite most of his demos failing in ways he couldn't have imagined. The Azure platform sounded pretty awesome, basically allowing custom .Net assemblies to be uploaded and invoked on Microsoft servers. He also managed to successfully demonstrate the cloud emulator for locally running and debugging cloud applications. I also really like the idea of being able to a write LINQ query joining two data tables from a SQL Server Service in the cloud.

Mike Culver presented the Amazon Web Services cloud solutions to the Vic.NET user group about a year ago and I think he sold their technology and the possibilities better. He showed the architecture of a video compression service that used a message queue service to queue requests and a controller application that can automatically rent, build and deploy servers capable of processing the requests. I thought it was pretty cool. I think Microsoft are a long way behind, but we'll see what the software giant is capable of. I'm certainly keen to get in and give it a go.

I still wasn't feeling well and missed most of Tim Sneath which was a shame, Brodie had seen him in London years ago but said he "knew his stuff". At the networking drinks I chatted with Michael Kordahi which was kind of fun, he'd judged my entry and he is a Silverlight guy. He introduced me to a couple of people, one of which was Jeremi Kelaher who does Strange Devices Podcasts. I also ran into Tatham Oddie on the way out who I'd seen present MVC stuff at REMIX and Vic.NET. I would have liked to have chatted with Andrew Coates who is a great presenter and was also a DevSta judge, but I didn't end up seeing him this time.


Tags:
Categories:
Comments (0)

DevSta {Challenge 2008} - 2nd Place!

October 23, 2008 13:40 by tarn

The winners for DevSta {Challenge 2008} have been announced: GPS Tag wins Microsoft Devsta Challenge. My Desktop Racer entry came 2nd. I'm naturally disappointed I didn't win, but definitely happy with 2nd. It was heaps of fun entering and I'm glad the stress is over ;-)

Congratulations to all the entrants, particularly the winners Jarred Sarge and Michael Minutillo for their GPS Tag Game. Hopefully we'll compete again next year.  Enjoy Vegas boys.


Tags:
Categories:
Comments (1)

Command Line Arguments

October 21, 2008 11:28 by tarn

I still write heaps of little console applications to do all sorts of things. Most of the console applications I write are tools for tasks like deployment, manipulating data and processing images. Usually they are for my own use, if I find one really useful I might pass it on to the rest of our development team or publish it on this blog. 

For these little tools, I prefer using console applications over windows application as I feel they are more useful. Even though its easier to use a tool that has a little GUI, its not more useful if you ever want to automate it.

Ideally I would prefer to write assemblies and use them from Powershell as it would provide the greatest usage flexibility. But it also requires competence scripting with Powershell. I was very excited about Windows Powershell when it was released, but a year on, I still haven't made it the integral part of my development toolset I thought it would be. This is an area I want to improve on, but this post is about command line arguments.

Generally these little non production, internal applications are quite light weight and fit for purpose. I find I often write simple logging and command line processing that is also fit for purpose. This is fine, its usually very simple, quick to write and works. The problem is that I write this same functionality over and over again for each little application.

I was toying with the idea of writing a generic command line parsing class. I figured I could use reflection on a settings class decorated with attributes describing the specific command line syntax. I figured I could even generate the help text using reflection too. While this seemed like a great idea, I am aware of time I've previously spent (wasted?) re-writing code where a perfectly good open source solution is available.

So I did a search and found a couple of existing command line solutions. Not surprisingly, given they were both written in .NET 2.0+, they both used reflection and attributes. I ended up using the Command Line Parser Library from CodePlex. I'm not sure if it does everything, but it does everything I need for 99% of the console applications I write that require command line processing.  

Anyway here is all the code from my test application:

public class Options
{
    [Option("r", "run", HelpText = "Run the export", Required = true)] 
    public bool Run;

    [Option("n", "Name", HelpText = "Use a specific name")]
    public string Name = String.Empty;

    [HelpOption(HelpText = "Dispaly this help screen.")]
    public string GetUsage()
    {
        HelpText help = new HelpText(new HeadingInfo("Hello World Example", "1.0.0"));
        help.Copyright = new CopyrightInfo("sharpthinking", 2008);
        help.AddPreOptionsLine("Usage: Hello --run");
        help.AddOptions(this);
        return help;
    }
}
class Program
{
    static void Main(string[] args)
    {
        Options options = new Options();
        if (Parser.ParseArguments(args, options, Console.Error))
        {
            string name = string.IsNullOrEmpty(options.Name) ? "World" : options.Name;
            Console.WriteLine("Hello {0}!", name);
        }
    }
}

And we get the expected usage from the command line

image

Now I have to make sure I bundle the console applications with an external assembly, but it does make using command line arguments easy.  


Tags:
Categories:
Comments (0)

Functional Programming

October 19, 2008 19:18 by tarn

The Problem, The Opportunity

 

Brodie and my initiatives in the area of search have been rewarded by management with a two week project to improve search on our sites. In the two weeks we also intend to fix one of the darkest corners of the code base, the Product/Site publishing logic, which is the main reason the search needs revisiting. We couldn't get the two weeks just to improve infrastructure, so we promised sub one second search results and better relevance as a carrot.

The web application powers multiple sites (>100) and has lots of products (>100,000). There needed to be some way of controlling what products belonged to what site. As products can appear on multiple sites and new sites are constantly being added, this is not trivial. The solution implemented is a site publishing rules system. Products can be excluded from a site by Classification, Product or Brand.

So far the logic is pretty simple, but another type of rule was also added. Products, Classifications and Brands can also be made specific to a site (they can actually be made specific to multiple sites). The logic is that a Product, Classification or Brand that is specific to a site will appear on that site, but not on other sites. In my opinion this is kind of crazy, but the thinking is; If you have a very specific product that should only be published on one or two sites, its easier to make it specific to those two sites rather than excluding it from every other site and new sites. Ok, fair enough.

The problem is that this logic is scattered through all our layers down to the database. There is duplication of logic over separate layers causing maintenance problems and it has been identified as a clear performance bottle-neck of any product list retrieval operation.

The scope of this prototype is a quick, robust, maintainable and fully unit tested solution that implements the product site publish logic.

Database

 

These rules are held in 6 SQL joining tables, which I think is interesting itself. The primary keys are not unique across products, brands and classifications meaning to enforce referential integrity and not allow nulls you do need a few tables. While I think this is a very good argument to have primary keys unique across all your database objects, its not always an option.

We could also have used 3 tables by combining the excluded and specific tables and adding a flag or lookup id. I think less is better, less tables means we can be more flexible with how we manipulate the data in code. Even in T-SQL I think I'd prefer less tables, less joins and more where statements - de-normalized. But that just my opinion in this scenario. 

As I move this logic out of the database and into managed code I'll end up abstracting the data to one table anyway, for this prototype it doesn't matter how the rules are created or persisted.

Functional Languages

 

Now back to the title of this post, I think this problem is really suited to functional programming. I've become more interested in functional programming after listening to a Heading Code pod cast with Matt Podwysocki. My experience with functional programming is limited and I've never found it very exciting. I learnt a bit of Haskel to help a mate with a computer science subject he was doing, but I was never really going to get to into it. At that time I believed all languages were inferior to C++.

As I've become more involved in software development I think its more important to use other languages. There is an inevitability that your language of choice will be relegated to legacy tasks as newer languages emerge better suited modern computing and user experience. C# code making heavy use of generics, extension methods, anonymous delegates and LINQ is just not the same language as the C# that rolled out with .Net 1.0. Dynamic and functional languages are becoming first class citizens.

I'm not actually going to write this in F# for now, but I think the way I've implemented my solution in C# is very functional although I leverage some OO concepts I might not be able to use in a strictly functional language.

 

Solution Overview

 

Instead of having all these separate lists of rules I've just created one list which can be queried. It makes heavy use of what Rob Conery described as a Pipes and Filters pattern in the ASP.NET MVC Storefront series. The pipes and filters are implemented as extension methods that add filters and data transformations. A collection of these small and easily testable extension methods can be used to build very powerful data retrieval and processing constructs to clearly describe business logic.

The scope of the prototype is really just a single function that implements the site publishing logic described earlier. It takes a Product, Site and List of Rules as parameters and returns the publish rule.

As soon as I started thinking about a solution to the problem I knew there was one thing I didn't want to do. I could see there was potentially a lot of repeated code for calling logic for products, classifications and brands individually. There is no need as it is the same logic that needs to be applied to each. I wanted to find a more elegant solution. I used an object orientated solution; the Rule class itself is abstract and is inherited by ProductRule, ClassificationRule and BrandRule.

public abstract class Rule
{
    public int SiteId { get; set; }
    public RuleType Type { get; set; }
    public int Value;
    public abstract bool Compare(Product product);
}

 

The only thing implemented by the specific rule classes is how to bind the Value to a property on the Product class.

public class ClassificationRule : Rule
{
    public override bool Compare(Product product)
    {
        return (Value == product.ClassificationId);
    }
}


public class ProductRule : Rule
{
    public override bool Compare(Product product)
    {
        return (Value == product.ProductId);
    }
}

..

 

I could have implemented this in a more functional way, but it does seem like a clean solution given I am using C#. 

Testing the Bits and Pieces

Testing the extension methods is easy with such simple domain objects. For most of the extension methods I only wrote a couple of small tests like this:

[TestMethod]
public void FromSite_ReturnsSiteRule()
{
    List<Rule> rules = new List<Rule>() { { new ProductRule() { SiteId = 1 } } };
    Assert.AreEqual(1, rules.FromSite(1).Count());
}

[TestMethod]
public void FromSite_FiltersSiteRule()
{
    List<Rule> rules = new List<Rule>() { { new ProductRule() { SiteId = 1 } } };
    Assert.AreEqual(0, rules.FromSite(2).Count());
}

 

I used LINQ constructs rather than the trusty delegates I use everywhere else for the extension methods. I like it, its clear and easy to read. And this is my point, its changed the language dramatically.

public static IEnumerable<Rule> FromSite(this IEnumerable<Rule> rules, int SiteId)
{
    return from rule in rules
           where rule.SiteId == SiteId
           select rule;
}

 

Generic Test Classes

 

I'm going to start using the the Microsoft Visual Studios Unit Test Framework. It means people who don't use the same testing framework flavour don't get build errors when they download code download and try to build it. As the attributes work with NUnit GUI and TestDriven.NET I'm happy to use it for now. I not sure if there is any mock testing support, but it does allow inheriting templated test methods.

For the same reason I didn't want to write code that called product rules, classification rules and brand rules individually, I don't want to write tests for them individually either.

public class GenericSiteRulesTests<T>
    where T : Rule
{

    [TestMethod]
    public void ProductWithNoRules_ProductIsReturned()
    {
        Assert.IsTrue(RuleLogic.ProcessRules(new Product(), new List<Rule>(), 1));
    }

    [TestMethod]
    public void ProductIdCanBeExcluded_ExcludedProductIsNotReturned()
    {
        List<Rule> rules = new List<Rule>() {{ CreateRule(1, RuleType.Excluded, 0 )}};
        Assert.IsFalse(RuleLogic.ProcessRules(new Product(), rules, 1));
    }

    ..

}

 

The way this is done I think is pretty cool, I did something similar when I was looking at testing Text Search algorithms recently. All the generic rule tests are written in a generic test class which must be templated to a type the derives from Rule.  The class doesn't have a test class attribute. This is because the test classes we want to run are actually classes that inherit from the generic test class.

[TestClass]
public class ClassificationSiteRulesTests 
    : GenericSiteRulesTests<ClassificationRule>
{
}

[TestClass]
public class ProductSiteRulesTest 
    : GenericSiteRulesTests<ProductRule>
{
}

[TestClass]
public class BrandSiteRulesTest 
    : GenericSiteRulesTests<BrandRule>
{
}
 

These classes do have the test class attributes. Although they are empty, they expose all the templated methods inherited from the generic test class. These methods can be executed by a test runner.

 

  image

NUnit GUI has running all the tests.

The Logic

 

Ok, so its not really "clear and easy to read", but it is very compact and concise.

public class RuleLogic
{
    public static bool ProcessRules(Product product,  IEnumerable<Rule> rules, int siteId)
    {
        // return false if there is any exclusion rule that applies to this product
        if (rules.FromSite(siteId)
                 .WithType(RuleType.Excluded)
                 .AppliesTo(product)
                 .Count() > 0) return false;

        // every rule type (ie ProductRule) that any site has a specific rule relating too
        foreach (Type type in rules.WithType(RuleType.Specific)
                                   .AppliesTo(product)
                                   .RuleTypes())
        {
            // this product requires a specific rule of this type
            if (rules.WithType(RuleType.Specific)
                     .WithTarget(type)
                     .FromSite(siteId)
                     .AppliesTo(product)
                     .Count() == 0) return false;
        }

        return true;
    }
}

 

Conclusion

 

It all seems fine, all the 33 tests are running and passing, coverage is at 100%.. still I have some strange feeling there is still a critical bug in there, somewhere. 

image

We'll wait and see Monday morning if the project will go ahead and this prototype can be extended into part of a production quality product publishing rules engine. Oh, and the sub 1 second product searching..


Tags:
Categories:
Comments (0)

Text Searching

October 14, 2008 21:11 by tarn

Introduction

Brodie and I recently started presenting code, prototypes and ideas to the other Devs we work with. This post came about after Brodie presented a search engine prototype. He was interested to find if we could load about 100,000 pages of text to in-memory collections for fast searching. The prototype used WCF and LINQ to create a RESTful server that binds to a HTTP endpoint. I thought it was a really cool prototype, it used less memory than I expected and was many orders a magnitude faster than the incumbent full-text indexed SQL query.

We were discussing how we could make non-cached searches faster and agreed the String.IndexOf() and String.Contains() methods inside the filtering and relevance Linq queries would be the best place to start. I probably wouldn't have done anything more with it but I couldn't stop thinking about how different indexing full-text methods we discussed might perform, particularly a tree base algorithm. 

The following week I presented a text search engine interface, some implementation prototypes, a testing framework prototype and the use of the StructureMap to inject concrete implementations of a text search engine interface into Brodies prototype. I'd already taken this way further than I ever intended, but I decided that since I had gone this far I might as well wrap it up in a blog post too.

I started off by creating a search engine interface. I used an interface from the start as I wanted to try mock testing and dependency injection.

public interface ITextSearchEngine
{
    object Setup(string searchText);
    bool Contains(string query);
    int Find(string query);
    List<int> FindAll();
}

The most simple search engine is just using a string and the methods on System.String. It shows how easy it is to implement the ITextSearchEngine interface: 

public class SimpleTextSearch : ITextSearchEngine
{
    private string _contents;

    #region ITextSeach Members

    public object Setup(string searchText)
    {
        _contents = searchText;
        return _contents;
    }

    public bool Contains(string searchText)
    {
        return _contents.Contains(searchText);
    }

    public int Find(string query)
    {
        return _contents.IndexOf(query);
    }

    public List<int> FindAll()
    {
        throw new NotImplementedException();
    }

    #endregion
}

 

Search Engine Test Framework

The test framework is a custom framework that can load types that implement ITextSearchEngine, ITestHarness and ITextDataStore from external assemblies at runtime. ITestHarness instances are created and passed new instances of ITextSearchEngine and ITextDataStore. The test methods in the test harness are executed and the results a saved as XML and also transformed to HTML. 

The ITextDataStore was added to make it possible to write tests that are independent of actual data. This is for performance tests where the data will effect the results. I intend to verify this by running a tests against different data. 

public interface ITextDataStore
{
    string SearchText();
}

 

The ITestHarness interface must be implemented by test classes. The Create(..) method allows the test framework to inject ITextSearchEngine and ITextDataStore implementations into the test class.

public interface ITestHarness
{
    void Create(ITextSearchEngine engine, ITextDataStore datastore);
    void Setup();
}

 

Now we can start implementing some tests before we've even implemented a search engine. To do this we start by creating a base test class:

public class TestHarnessBase : ITestHarness
{
    #region Data Members

    protected ITextSearchEngine _searchEngine;
    protected ITextDataStoreTest _dataStore; 

    #endregion

    #region ITestHarness Members

    public void Create(ITextSearchEngine engine, ITextDataStoreTest datastore)
    {
        _searchEngine = engine;
        _dataStore = datastore;
    }

    public virtual void Setup()
    {
        
    }

    #endregion
}

 

The test framework will consider public methods with a known return type and no parameters as test methods.

 
public class SearchEngineSearchTests : TestHarnessBase
{
    #region Test Methods

    public bool Contains_FindExistingWord()
    {
        _searchEngine.Setup("The quick brown fox jumps over the lazy dog");
        return _searchEngine.Contains("fox");
    }

    public bool Contains_NotFindNonExistingWord()
    {
        _searchEngine.Setup("The quick brown fox jumps over the lazy dog");
        return !_searchEngine.Contains("cat");
    }
    #endregion
}

 

Text Searching Algorithms and Indexing

There is lots of documentation on the Boyer-Moore algorithm, I didn't look into it too deeply as its already been implemented in .NET, but also as it is designed to work when is only possible to enumerate through the searchable text. As we have our searchable text in-memory we could leverage random access to the data.  Aho-Corasick is the same, but it uses a cool tree to search for multiple words at the same time. It was that tree and later some influence from the Suffix Tree that led me to think I could store the entire searchable text in a tree.

A Text Tree

Consider the string "big bad barry" represented as a tree as below. 

image_thumb1[4]

This tree is great for exact match string searching; Start with the root node and the first character of the search string. Check if the current node has a child matching the current character. If it does, move to the node and the next character and repeat the process. If it doesn't then the word doesn't exist in the searchable text. If we find a match for every letter, the word exists in the searchable text.

 

A Text Tree With Indexes

The previous tree is only good for the Contains() method as the tree has no information about where the matches are found or how many times they occur. To add this information to the tree we could add a new node at the end of every word with its index in the original text

image_thumb3

We follow the same procedure as before to find words, only this time we know if we have found a complete word (if there is one or more index nodes on the last node we check). We can also find all instances of the word by traversing the rest of the tree and returning all the index nodes found. For example we can see that "ba" is in the string twice by finding all the index nodes after the 'a' node; In this case at characters 4 and 8.

From this tree we can also find entire sentences if our search algorithm is good enough. Basically it just needs to find the words in the sentence individually and then using the indexes and the word lengths work out which words combinations were next to each other in the original text.

The reason this is all possible is the tree has enough information to re-build the original string.

A Text Tree With Indexes and Suffixes

In the previous trees we can only search efficiently for words and prefixes. What if we want to be able to search for substrings? We can do this by also adding all the suffixes of each word to the tree.

image_thumb7 

Now we can search for words, substrings and sentences and get all the occurrences. That's pretty cool, but this tree is getting pretty massive compared to the original string.

Tree Implementation

I'm not actually going to describe how to implement this tree, its easier just to check the code and step through it if you have too.

The search engines I have implemented are

SimpleTextSearch Using IndexOf() and Contains() methods on System.String
RegexTextSearch Quick and dirty Regex implementation
SimpleTreeTextSearch The basic tree with no index information. Only implements the Contains() method
IndexTreeTextSearch The basic tree with index information. Search algorithm only looks for a single word
ExtendedIndexTextTreeSearch Extends the IndexedTreeTextSearch search algorithm to support multiple word search phrases
SuffixIndexTreeTextSearch Extends the IndexedTreeTextSearch to include suffixes in the tree.
ExtendedSuffixIndexTreeTextSearch Includes suffixes and the search algorithm from above that supports multiple word search phrases.

 

Some things I didn't do but would look into if I was actually going to use this anywhere:

  • Hash tables for each nodes children (currently using List<>)
  • Split text into words more effectively and maybe remove stop words (How to Split a Text into Words might be worth reading)
  • Case insensitive search algorithm (with case information still in the tree)
  • Removing casing from searchable text

 

To be continued...

To be honest, I got distracted away from this by getting into the new Silverlight 2 SDK for my DevSta entry, Desktop Racer. DevStar is an Australian Microsoft Visual Studios development competition and I figured Silverlight 2 is a little more exciting than a text searching algorithm by someone with little to no background text search theory. Never-the-less I've got the ExtendedSuffixIndexTreeTextSearch passing all the functional search test I've written. I've uploaded a copy of the results and the project in its current state.

HTML Functional Test Results

TextSearch VS2008 Project

I haven't yet written any performance tests to establish each engines setup speed, search speed, CPU usage and memory usage. I think this will be very interesting, while I'm pretty confident the tree search methods will be quick, I suspect they will require significantly more memory than storing plain text. I would not be surprised to find that the memory usage and setup times for my implementations of tree searches make them impractical to use for most real world search problems. Regardless, it has been lots of fun just getting the tree search functional but I'd be surprised if I've added any value to one of the most researched topics in computer science.


Tags:
Categories:
Comments (0)