Open source search engines every developer should know about.

May 29, 20111 Comment
Facebook Twitter Pinterest Plusone

Search is a crucial feature of any website.  Even if your navigation is crystal clear, it still doesn’t cater for those power users and return visitors that remembered a particular piece of content, or want to find collections of information stored on your site in the same topical areas.

Typically search is one of the most poorly implemented pieces of technology on a site, with developers opting for the standard the out of the box solution which comes with most modern content management systems – and in many cases doesn’t do justice to your content. I thought I’d take a look at what other enterprise level and open source search engines out there to find and index the information on your site faster, and provide users with a deeper, more relevant resultset.

Constellio

URLhttp://www.constellio.com/

Constellio is an open source search solution suitable for enterprise level search. It is built on the Apache Solr project, which uses the Lucene project as its main engine and provides  both indexing of webpages and documents via its web based interface. You can select which type of documents to index, including folders and wildcard filenames, and Constellio provides both the search interface and granular control over what gets indexed. It also has indexing support for technologies such as sitemap protocol and RSS.

SearchBlox

URL: http://www.searchblox.com/

Another open source search solution built around Lucene, Searchblox offers a number of advantages over its nearest comparable product Google Mini, and again is based around cross platform technologies (Java). It’s main advantages is that it provides a level of abstraction from Lucene for developers, with a simpler API to interface with, so you can quickly deploy a solution without having to understand all the underlying complexities of Lucene, and offers indexing across third party websites.

Apache Solr

URL: http://lucene.apache.org/solr/

Apache Solr is an open source enterprise level search solution with features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites. At its core, Apache Lucene (a well respected Java search library used in many of the afore mentioned products) is used as the underlying engine, with both technologies going hand in hand.

There are also a number of web based implementations of Solr available for those dev’s not wanting the hassle of deployment themselves. Websolr for example provide a platform independant offering of Solr in the cloud, as does PowCloud (still in Beta) which also offers WordPress support integration.

Sphinx

URL: http://sphinxsearch.com/

Powering top sites such as Craigslist and Dailymotion, Sphinx is a cross platform open source search server written in C++, which lets you search across various systems, including database servers and NoSQL storage and flat files. A variety of text processing options enable ease indexing of documents, with the ability to fine tune how the relevance algorithm works. Once deployed and setup,  searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.

Alternative WordPress Search Plugins

Let’s face it, WordPress search is pretty naff, and for anything or than basic searching is inadequate. So, what’s wrong with the implementation of WordPress search then?

1) It automatically assumes that the freshest content for a particular keyword search is the most relevant.

2) The algorithm doesn’t place any weighting on results based on links

3) The algorithm doesn’t rate the title as more important than the content

4) No highlighting of the terms searched are displayed in the resultset to let users quickly determine relevance.

There’s definitely scope for improvement, and a number of solutions have popped up to fill that particular gap in its implementation and feature set. Here are some of them:

Google Search for WordPress

URLhttp://wordpress.org/extend/plugins/google-search/

If you are looking to get up and running quickly with the Google search API, this plugin for WordPress offers an API implementation, with scope for phrase highlighting all provided in an AJAX interface.

More from Google

URL: http://wordpress.org/extend/plugins/more-from-google/

Augments and extends the existing WordPress search interface with additional posts from others that Google have found on the topic.

Search API

URL: http://wordpress.org/extend/plugins/search/

If you are a developer looking to extend upon the existing WordPress functionality, this plugin gives a good framework on which to build your solution with some ready made function calls to enhance things. It supports advanced search capabilities such as boolean search, multiple content searches (posts, tags, pages, authors and any available metadata) and flags (finding posts with A string in category C).

Relevanssi

URL: http://wordpress.org/extend/plugins/relevanssi/

Relevanssi has both free and premium options available, and solves a number of the problems that I’ve highlighted above. There is support for both boolean searches and fuzzy logic out of the box, and the following additional features.

  • Search results sorted in the order of relevance, not by date.
  • Fuzzy matching: match partial words, if complete words don’t match.
  • Find documents matching either just one search term (OR query) or require all words to appear (AND query).
  • Search for phrases with quotes, for example “search phrase”.
  • Create custom excerpts that show where the hit was made, with the search terms highlighted.
  • Highlight search terms in the documents when user clicks through search results.
  • Search comments, tags, categories and custom fields.

Search Everything

URLhttp://wordpress.org/extend/plugins/search-everything/

Search Everything is perfect for you if you want to search custom data stored within WordPress such as post types, or fields and it also supports searching across attachments.

Search Unleashed

URL: http://wordpress.org/extend/plugins/search-unleashed/

Search Unleashed is an extensible plugin with support for a number of search engines – including Apache Lucene. It comes with the standard implementation, MySQL fulltext and Lucene engines all ready to be deployed, and provides a neat “Priority based” search capability that denotes relevance from where phrases occur inside the WordPress post.  Incoming searches from third party engines can also be given a CSS style to show the searcher how they found the page.

Notable Mentions

http://www.coveo.com/en/products/coveo-expresso – Free for up to 50 users and 100,000 documents. Might be useful for small enterprises

http://sna-projects.com/zoie/ – Real time search indexing built ontop of Lucene.

http://xapian.org/ – search library built on C++

http://www.indextank.com/ – Powers many of the larger social sites (Reddit etc. )

http://www.kneobase.com – An open source solution that indexes zip files, Microsoft Office and more, before turning them into HTML representations and delivering results.

Google SiteSearch

URLhttp://www.google.com/sitesearch/

Site search is aimed primarily at websites, and unlike Google Mini, wouldn’t be appropriate for an intranet scenario. It is a fully hosted solution, and offers a number of cool features to site owners looking to enhance the existing search functionality found on their site.  Pricing for site search is on a query basis per year. Starting at $100 for 20,000 queries a year, it’s an inexpensive option for those with lower traffic, but the irony is – you’ll probably not need it until your content gets unmanageable. Obviously, with the more content you have, the more traffic you have, and that’s going to push the cost up.  It is however worth considering, as the technology behind the scenes – as you can imagine, is pretty top notch. For those of you who would rather just tap into the technology, Google Custom search offers an Adsense supported option (which you can receive a revenue share on), that lets you use Google tech for free – customisation from a look and feel perspective is however, limited.

Google Mini

URLhttp://www.google.com/enterprise/search/mini.html

Google Mini is a server based solutions, which offers a way to deploy Google technology inside your website easily. Once deployed, Mini crawls your Web sites and file systems / internal databases, indexing and caching the contents as it goes finally delivering search results through a uniform interface that can be tweaked and designed how you want through their API’s. Costs start at $1,995 (direct) plus a $995 yearly fee after the first year for indexing of 50,000 documents, and scales upwards. For example, a 300,000-document license, the initial cost is $8,995.  Google have also another step up from that again, but it is in many cases outside the scope of budget for many small businesses – with Google search appliance offering all the bells and whistles of Google technology with unlimited indexing for a cool $30,000.

URL: http://www.dataparksearch.org/ - a full-featured open sources web-based search engine released under the GNU General Public License and designed to organize search within a website, group of websites, intranet or local system.

URLhttp://www.open-search-server.com/ - a modern and robust search engine and a suite of high-powered full text search algorithms. Built using the best available open source technologies, OpenSearchServer is an high-performance software and you can embed in all you applications to a better Information Access.

URL: http://openfts.sourceforge.net/ - OpenFTS (Open Source Full Text Search engine) is an advanced PostgreSQL-based search engine that provides online indexing of data and relevance ranking for database searching. Close integration with database allows use of metadata to restrict search results.

URLhttp://www.elasticsearch.org/ – Built with the Cloud in mind, Elastic Search has a very advanced distributed model, speaks natively JSON, and exposes many advance search features, all seamlessly expressed through a JSON layer.

Filed in: Programming
Tagged with:

About the Author ()

Paul is a regular 30 year old web bloke / programmer with a penchant for online marketing. This blog is a personal outlet, with an eclectic mix of articles.

Comments (1)

Trackback URL | Comments RSS Feed

Sites That Link to this Post

  1. How Do I Search Thee? Let Me Count the Ways : Beyond Search | June 14, 2011

Leave a Reply

Back to Top