Project Description

Lucandra.NET is a Lucene + Cassandra implementation written in C# which is based on the Lucandra (Java) project by Jake Luciani. Apache's Lucene is a high-performance full-text search engine, and Apache's Cassandra is a promising NoSQL database, originally developed by Facebook.

Note: This page is currently under construction and I'm working on the documentation as much as I can as time permits. Not all source code is fully documented, but will be.

About Lucandra.NET

Lucandra.NET originally started off as a direct port of Jake Luciani's "Lucandra" project (https://github.com/tjake/Lucandra), driven by curiosity and the desire to learn a little bit more about the inner workings of Lucene and Cassandra. After completing the port from Java, I realized that this truly is a valid and promising replacement for the traditional file-based segment stores used by Lucene. We decided to use this in one of our production products, so since then I've gone through and re-written/re-factored a lot of the code and tried to squeeze the most out of it that I can in hopes that some other users of Lucene.NET would find it useful. Note that the code no longer directly represents the original Lucandra code as I've kind-of gone my own way with things; so there's a little learning curve if you're hoping on being able to walk into this code with familiarity of Lucandra's code (though some is still similar).

Some things to keep in mind:
  • Lucandra.NET is currently written for Cassandra 0.7 and Thrift-0.5.0 and will not work (without modification) with versions previous to Cassandra 0.7. It has been tested with Cassandra 0.7-beta3 and Cassandra 0.7-RC1.
  • Lucandra.NET is built against Lucene.NET 2.9.2.
  • Lucandra.NET is not compatible with Lucandra (Java). This is due to the fact that Lucandra.NET uses a different data model within Cassandra and also that Lucandra uses Java object serialization, which is for obvious reasons not compatible with .NET's serialization.
  • In the current version, it is expected that you will use a ByteOrderedPartitioner as your partitioner in Cassandra. This facilitates wildcard & range queries, sorting, etc.
  • Lucandra (Java) performs hashing on keys stored in Cassandra; Lucandra.NET does not currently do this. This is primarily due to the fact that I do not fully understand the inner-workings of the partitioners in Cassandra and until I have some time to really play with them and see the how the partitioning works across a cluster, this won't be implemented.
  • Solandra (part of Lucandra: Solr + Cassandra) will not be a part of this project as it is beyond the scope of this project to port Solr, and I have very little familiarity with Solr as a whole.
  • While Lucandra.NET is written in .NET, the Cassandra node(s) can still be run on any operating system supported by Cassandra/Java.

Why Lucandra.NET?

Lucandra.NET offers several advantages (and disadvantages) over a traditional Lucene.NET implementation. Have a look here and see if Lucandra.NET is right for you.
  • Pros:
    • Removes the complexity of sharding and replication of Lucene indices as Cassandra's architecture handles all of this for you.
    • Your Lucene.NET-generated indices can now be easily stored on commodity (implementation dependent) non-Windows (or Windows, but why pay the license fee?) servers without the hassles and latency of networked/clustered file systems.
    • Real-time indexing. The documents you index will be available (almost) instantaneously for searching.
    • Always-open reader. Once you open the LucandraReader on the index, you do not need to refresh it to see updates to the index. In-fact, in most cases you can search for a document only milliseconds after it has been indexed.
    • No optimizing. The keys inserted into Cassandra are automatically sorted and indexed by Cassandra itself, and there are no segment files to speak of - so no merging :-)
    • Ability to work with extremely large indices without explicit index sharding.
    • Creating and working with new indices is as easy as opening a writer/reader and passing the name of the index (this may change now that Cassandra supports dynamic keyspace/column family creation). You don't need to bother with file paths anymore.
    • Multiple writers can write to the same index at the same time.
  • Cons:
    • Naturally, since we need to make a round-trip to the Cassandra node/cluster, Lucandra.NET search times are slower than those of local disk-based/SAN solutions (not by much).
    • Not all features of Lucene.NET are supported by Lucandra.NET, and as such...
    • Lucandra.NET is not a drop-in replacement for Lucene.NET. While Lucandra.NET subclasses the IndexReader class (LucandraReader), it uses its own index writer (LucandraWriter).
    • As of right now, document keys are pseudo-randomly generated. This means that documents will not be returned from Cassandra in the order they were inserted, and therefore it is impossible to walk documents in the index from first->last inserted.
    • As this is ALPHA software, if you plan on using this expect to have to wipe your index and start over between releases - at least until beta ;-)

Functionality

  • Working:
    • Indexing documents
    • String-valued stored fields
    • Deleting documents
    • Searching: Terms, ranges, wildcards and sorting
    • Unicode support
    • Multi-threaded indexing (against same LucandraWriter)
    • Multi-threaded searching (against same LucandraReader)
    • Highlighting
  • Questionable:
    • Scoring. Works, however I've seen some cases where I think one document should have won over another; I need to compare this against a normal Lucene index before declaring it as working.
  • Implemented but untested:
    • Faceted search
    • Binary-valued stored fields
  • Not working:
    • Currently, there is no efficient way to retrieve the total number of documents or terms in an index.
    • Currently, there is no efficient way to delete an index or all documents in an index.
    • Field cache
    • Multi-threaded deleting (against same LucandraWriter - but why?)

Cassandra Schema

NOTE: As of changeset 1516 the term info column family is no longer a super-column!

Lucandra.NET stores information in two column families: "T" for term info (standard column) and "D" for documents (super-column). The schema below is subject to change; it is not yet fully optimized in that flags will become bit vectors instead of individual bytes, and frequencies, positions and offsets are currently stored as 4-byte integers and I am planning to change this to use Lucene's built-in variable-int classes to save as much space as possible. Implemented in latest source versions. I have also been considering adding a "F" column family to store field info, similar to that of a traditional Lucene directory (I believe a FieldCache could be efficiently implemented then?).

Documents: The "s" super-column is for storing stored field values and the "i" super-column is for keeping track of terms indexed for the document. For stored fields, each stored field is stored with its occurrence number which is used to facilitate stored fields with several values. If the field "Answer" were to have two values, there would be two columns: "Answer.0" and "Answer.1", each storing its respective value.

D => {
    [index name, document id]: {
        s: {
            name: "s",
            columns: {
                storedField1.#: {
                    name: [field name],
                    value: [binary flag, value]
                },
                ... <more stored fields> ...
            }
        },
        i: {
            name: "i",
            columns: {
                indexedTerm1: {
                    name: [field name, term text],
                    value: [store positions flag, store offsets flag, frequency, positions, offsets]
                },
                ... <more indexed terms> ...
            }
        }
    },
    ... <more documents> ...
}


Term Info: CHANGED 2010-11-24!! No longer a super-column as of changeset 1516. All index terms are stored in this standard column family. Each "row" key is the term key (index name/field name/term text) and each column is a document pointer, which points to documents in the "D" super-column above.

T => {
    [index name, field name, term text]: {
        columns: {
            document1: {
                name: [index name, document id],
                value: [use norm flag, norm, store positions flag, store offsets flag, frequency, positions, offsets]
            },
            ... <more document pointers> ...
        }
    },
    ... <more terms> ...
}

Testing & Performance

All of my performance/stress testing as of this writing has been performed on the following machine with only one Cassandra node. The index writer and reader are also running on this same machine.
  • CPU: Intel Core2 Quad Q6600 @ 2.4 GHz
  • RAM: 4GB, with 1GB given to the JVM running Cassandra.
  • Disk: 74GB 10K RPM system drive; 2x500GB 7200 RPM drives in RAID 1 on which the Cassandra database is run (both commit log and data are on this drive - not recommended).
In a few weeks I'll have access to some better servers to test with (quad core @3.6GHz with 8GB and 1TB RAID10, and an 8-core @2.0GHz with 32GB and 1.5TB RAID10) and will post test results (both clustered and non-clustered) at that time. I will use several different size/style of data sets.

As of preliminary testing, as long as Cassandra can take advantage of its cache and doesn't need to hit the disk, a 25GB index from Wikipedia was between 2 and 60ms on the machine above. Often I would see the first search take roughly 150ms and then drop down to 8-15ms the second time it was executed (keep in mind that Cassandra was only given 1GB RAM to work with). Currently, Lucandra.NET does not cache terms beyond the scope of a search. It does, however, cache documents. You can affect the performance of this by setting the term read-ahead threshold (how many terms the TermInfoReader looks ahead). More information to come in the Documentation section soon (currently under construction as time permits)!

Documentation and Examples

Please visit the Documentation wiki page for information on configuration and some simple examples. (A work in progress!)

Credits

  • Full credit to the original idea and implementation of Lucene using Cassandra as a back-end store in the Lucandra project goes to Jake Luciani. The Lucandra (Java) project can be found on GitHub here: https://github.com/tjake/Lucandra.
  • The connection pooling code is adapted from the Aquiles project (related files have notices). Aquiles can be found here: http://aquiles.codeplex.com/.

Disclaimer

This project is currently in ALPHA and its viability in a production environment is not guaranteed. If you are hoping to use this code in production, it should be taken on a case-by-case basis with your own tests. Also note that I am new to Cassandra, and while I try my best to do things "as they should be done", documentation is limited and best practices are seemingly non-existent. So, if anyone with more experience than I has any comments/suggestions/complaints about the way Cassandra is used within this project please don't hesitate to correct me :-)

Also, as mentioned above the Cassandra schema or interfaces within this project are subject to change until a stable version is released. Although I don't expect breaking-changes to occur unless I find some performance bottlenecks.

Last edited Nov 27, 2010 at 8:16 PM by cylwit, version 41