iVersant Site Search Service

 

Overview

The iVersant Site Search Service is a customized installation of MnogoSearch and can be used to provide a site search function for any site on the web. All of the technology is hosted by iVersant, so site search functionality can be added to any site easily and quickly. A customized search form, with your site's look and feel, is created for each site and hosted by iVersant. Although all the overhead is maintained by iVersant, your visitors have a consistent experience, feeling as though they never left your site.

Features

  • Full text indexing. Different priority can be configured for body, title, keywords, description of a document.
  • Supports all widely used single- and multi-byte character sets, including UTF8, as well as most of the popular Eastern Asia languages.
  • Automatic document character set and language guesser for about 70 charset/language combinations.
  • HTTP/1.0 support
  • HTTP Proxy support
  • Supporting gzip, deflate, compress content encoding
  • MySQL backend.
  • Basic authorization support (to index password protected areas)
  • Both HTML documents and plain text files can be indexed
  • Mirroring features
  • "keywords" and "description" META tags support
  • Continual indexing
  • Indexing depth can be limited
  • Robots exclusion standard support (both <META NAME="robots"> and robots.txt)
  • Easily customized search results
  • Boolean query support
  • Fuzzy search: different word forms, synonyms, substrings
  • Search on subsection of site

Getting Set Up

Sites hosted by iVersant can make use of the Site Search Service for free. Other sites can use the Site Search Service for $20/mo. Once you contact me, your site will be indexed (each page of your site examined and added to our search database) and a customized search results page will be created for you with your site's look and feel.

The only thing for you to do is decide is where to put a link to your new search page on your site. You may, optionally, create a form on your site for searching. Both of these tasks are detailed below.

Linking To Your Search Page

Once set up with the Site Search Service, you will be given the address (a.k.a. "link", or "URL") of your search page. It will be something like this:

    http://search.iversant.com/{your site}.zhtml

Where {your site} is replaced by the name of your site. For example, the search page for iVersant is http://search.iversant.com/iversant.zhtml. You can place this link anywhere in your site with code similar to this:

    <a href="http://search.iversant.com/{your site}.zhtml">Search This Site</a>

Including a Search Form In Your Site

If you would like to have a text input box, or a form, that your site visitors can use to search your site, rather than just a link, you can create your own form using standard HTML similar to this:

    <FORM METHOD=GET ACTION="http://search.iversant.com/{your site}.zhtml">
        <INPUT TYPE="text" NAME="q" SIZE=30>
        <INPUT TYPE="submit" NAME="cmd" VALUE="Search!">
        <INPUT TYPE="hidden" NAME="ul" VALUE="http://{your site URL}">
    <FORM>

Advanced Search Form Features

Several advanced form parameters are available for you to create extremely functional searches on your site:

  • ps (page size): Number of search results displayed on one page, 20 by default. Maximum page size is 100. This value does not allow passing very big page sizes to avoid server overload and might be changed with MAX_PS definition in search.c.
  • m (search mode): Currently "all","any" and "bool" values are supported.
  • wm (word match): You may use this parameter to choose word match type. There are "wrd", "beg", "end" and "sub" values that respectively mean whole word, word beginning, word ending and word substring match.
  • o: Search result type. 0 by default. You may describe several template sections for every part, for example "res". This allows choosing for example "Long" or "Short" search result output types. Up to 100 different formats are allowed in the same template.
  • ul (URL limit): URL substring to limit search through subsection of database.

These search parameters can be used in either a HIDDEN form field or in a SELECT field. For example:

    No. Of Results To Return Per Page:
    <SELECT NAME="ps">
        <OPTION>10
        <OPTION>50
        <OPTION>100
    </SELECT>

or:

    Section:
    <SELECT NAME="ul">
        <OPTION VALUE="http://{your site}/">Entire Site
        <OPTION VALUE="http://{your site}/meetings/">Meeting Minutes
        <OPTION VALUE="http://{your site}/articles/">Articles
    </SELECT>

Excluding Pages From Site Search

There are two ways you can exlude pages from being indexed by the Site Search Service; using a META Tag in a page or by creating a robots.txt file in the root directory of your site. Both of these methods are described below, however more detailed information can be obtained at www.robotstxt.org.

Using META Tags

If you place the following line in the <HEAD> section of any page on your site, it will not be indexed:

    <META NAME="ROBOTS" CONTENT="NOINDEX">

Using robots.txt

You can prevent the Site Search Service from indexing specific pages or entire sections of your website by creating a robots.txt file and placing it in the root directory of your website. If the Site Search Service finds this file in your website, it will read it to find out which pages or sections of your site you do not want it to index.

robots.txt details

The robots.txt is a TEXT file (not HTML!) which has a section for each robot to be controlled. Each section has a user-agent line which names the robot to be controlled and has a list of "disallows" and "allows". Each disallow will prevent any address that starts with the disallowed string from being accessed. Similarly, each allow will permit any address that starts with the allowed string from being accessed. The (dis)allows are scanned in order, with the last match encountered determining whether an address is allowed to be used or not. If there are no matches at all then the address will be used.

Here's an example:

    user-agent: mnogosearch
    disallow: /mysite/test/
    disallow: /mysite/cgi-bin/post.cgi?action=reply
    disallow: /a

In this example the following addresses would be ignored by the spider:

    http://adomain.com/mysite/test/index.html
    http://adomain.com/mysite/cgi-bin/post.cgi?action=reply&id=1
    http://adomain.com/mysite/cgi-bin/post.cgi?action=replytome
    http://adomain.com/abc.html

and the following ones would be allowed:

    http://adomain.com/mysite/test.html
    http://adomain.com/mysite/cgi-bin/post.cgi?action=edit
    http://adomain.com/mysite/cgi-bin/post.cgi
    http://adomain.com/bbc.html

It is also possible to use an "allow" in addition to disallows. For example:

    user-agent: mnogosearch
    disallow: /cgi-bin/
    allow: /cgi-bin/Ultimate.cgi
    allow: /cgi-bin/forumdisplay.cgi

This robots.txt file prevents the spider from accessing every cgi-bin address from being accessed except Ultimate.cgi and forumdisplay.cgi. Using allows can often simplify your robots.txt file.

Here's another example which shows a robots.txt with two sections in it. One for "all" robots, and one for the mnogosearch spider:

    user-agent: *
    disallow: /cgi-bin/

    user-agent: mnogosearch
    disallow:

In this example all robots except the mnogosearch robot will be prevented from accessing files in the cgi-bin directory. mnogosearch will be able to access all files (a disallow with nothing after it means "allow everything").

Sites Using iVersant Site Search

The following sites currently utilize the iVersant Site Search Service: