Friday, May 25, 2007

Convera Search Engine

SEARCH ENGINE-CONVERA


We were supposed to develop a search engine in Convera, and by that time we didn’t have any knowledge of either ‘search engine’, or Convera. An analysis was done into this and various approaches were tried to resolve this issue. This white paper suggests one of the approaches to work with Convera, and get the results from search engine.
This report assumes that the user has good knowledge about J2EE.

1. Search Overview
Search is the process of matching a query against indexes and classifications.

As the quantity of textual information available through the Web and on intranets continues to grow, one of the biggest challenges facing any organization is its ability to harness its own intellectual resources. The search engine is the application which searches the data and returns the results to the client. This usually means creating an HTML page in the specified format. Most search engines search within an index. A few just search the files in real-time, but that can get very slow.To send search criteria to the search engine, most systems include forms. The site visitor enters their search terms in a text field, and may select appropriate settings in the form. When they click the Submit button, the server passes that data to the search engine application.
Following are some vendors for search:
· Verity
· Convera
· Autonomy
· Fast





As one of the leading contestants in enterprise space, Convera RetrievalWare has been chosen as the enterprise search solution. Convera RetrievalWare® is the industry's first and most advanced knowledge retrieval solution for indexing and searching a wide range of distributed information resources all from a common user interface. RetrievalWare supports over 200 document types stored on file servers, in groupware systems, relational databases, document management systems, intranets, and the internet. RetrievalWare excels in distributed client/server environments and scales to large numbers of documents and users. Its multi-tiered architecture and configurationally options support easy scalability and flexibility for implementation across enterprise networks, intranets, and the World Wide Web. Convera SDK, the api from RetrievalWare, provides a robust set of sample code and documentation for RetrievalWare's high-level APIs for creating optimized and tailored solutions for your organization or business.







· Highly accurate, natural language concept searching based on Convera’s unique semantic network
· Range of search features, including concept and keyword searching, idiom recognition, fielded searching, query-by-example, and more
· Adaptive Pattern Recognition Processing (APRP™) technology used in pattern search allows search against error-prone text from OCR processes, misspelled words, and irregular names
· Custom Search Templates can be easily created for specific libraries for faster searching and field validation
· Document Summarization provides a fast and efficient way to determine a document’s subject and key themes
· Users can create and save Real Time Profiles and Public Categories that will automatically collect and organize incoming documents of interest



3.2. Intelligent Indexing
· Flexible Document Parser controls the way documents in a library are indexed and viewed
· Synchronizers detect changes to any indexed repository and automatically update the RetrievalWare indexes



3.3. Flexible and Extensible Architecture
· Web based administration wizards guide system setup and maintenance
· Rich Software Developers Kit, product options, toolkits and other Convera products, such as Visual RetrievalWare and Screening room, allow customization and integration into almost any environment with any range of multimedia sources



4. RetrievalWare architecture
The RetrievalWare architecture is a fully distributed client/server application running within a TCP/IP connected network of server machines. This architecture provides customers with a high degree of configuration flexibility and scalability, in addition to allowing the product to be extensible and easily integrated into other environments.
RetrievalWare is unique in that it can leverage additional server machines to maintain search performance over large data repositories and large numbers of users. RetrievalWare provides proven, unfailing access to all data types across multiple platforms, even when the number of concurrent users and the size of the collections increase significantly.



5. Search processes
RetrievalWare’s basic search architecture consists of four processes that work together to perform queries, please refer below figure.




RetrievalWare processes

· The Executive – manages and routes data among all of the other RetrievalWare processes on the server machine. The Executive routes data from multiple clients to server processes, and also manages the start-up, shutdown, restart, and monitoring of all server processes.
· Client Handler – handles all query requests related to a set of RetrievalWare libraries including dictionary lookups and requests for text of retrieved documents.
· Scheduler – assigns search tasks (requested by the client handler) to Search Servers, so that no Search Server is overloaded.
· Search Server – executes queries and returns the results to the client handler. The Search Server is truly multi-threaded and manages the network I/O and traffic. It contains the engine that performs the search against the RetrievalWare indexes.



6. Indexing processes
RetrievalWare indexing processes consist of those components:
· The Document Handler will receive the external customer document and queue the document for indexing, profiling, and/or RDB loading. A history of submitted documents will be kept to track which documents made it through the indexing processes, so in the event of a failure, only unfinished documents will be processed when the server is restarted.
· The Indexer gets the document from its repository, parses the document and performs the indexing function using the appropriate language plug-ins.
· The Cross Reference maintains a cross-reference mapping of the external customer document IDs with RetrievalWare document numbers (32-bit internal-id assigned at index time). The Cross-Reference caches external to internal ID mapping in disk-based cache files and memory based tables that allow for instant cross-referencing, typically to support efficient document access for operations like viewing and printing.



7. Toolkit architecture
The RetrievalWare ® Java ™ Server Page (JSP) Toolkit allows you to build a web-based interface to the RetrievalWare servers. The Java server pages use RetrievalWare’s Java high-level classes, which in turn call the RetrievalWare C high-level functions. The following diagram illustrates RetrievalWare’s JSP architecture:



8. Toolkit components
The JSP toolkit components are installed in the following directories:





9.1. Initialization of Retrievalware
RWLoginSession class serves as a entry point to the Retrievalware API. Upon instantiation of this class Retrievalware is initialized. RWLoginSession can be embedded in Java Server Pages as a bean or used in any high-level Java customization code.
JSP Example :

Java Example: rwLoginSession = new RWLoginSession();



9.2. Setting login options
Default login options can be set in the configuration file rwserver.cfg located at<>/rware/config/rwserver.cfg.These options are available for HLINIT handle.
In the java program you can set login options using RWLoginSession.setOption() method call. Depending upon which option you’re setting, you set it with one of three different types of values (boolean, string, or integer. RWQuery objects also each have their own thread handle, on which options may be set specific to that query object.



9.3. Login to the Retrievalware servers
The next step is to log into RetrievalWare. This function is handled by the login methods in RWLoginSession. After instantiating RWLoginSession, you must call a login method, whether or not your system is running the security servers. At login, you receive a list of the available libraries from the RetrievalWare name servers (cqns). Which login method you use depends upon what security information is known—use loginKey() when security is performed by an external source and a valid security ticket (key) exists. Otherwise, use login() or loginDomain(), which both require a user name and password.
Turning security on : - RWLoginSession.setOption(RwLoginSession.getInitHandle(),
RWLoginOption.HL_OPT_USE_SECURITY, true);
§ If security is on, use one of the following:
RWLoginSession.login(String username,String password)
RWLoginSession.loginDomain(String nameServer,String username,String password)
RWLoginSession.loginKey(String nameServer,String key)
§ If security is off, use one of the following, with null as the username and password:
RWLoginSession.login(String username,String password)



9.4. Logging in using a username and password
The method RWLoginSession.login() uses the name server specified by HL_OPT_NAME_SERVER to log into a RetrievalWare domain; or if none is specified, it broadcasts for the name server. It takes the username and password as arguments.
If security is not on, you may set the username and password to null.
Example :
try {
if(rwLoginSession.getOption(rwLoginSession.getInitHandle(),
RWLoginOption.HL_OPT_USE_SECURIY)) {
// Get userName and Password from the URL.
String userName = (String) request.getParameter("USER_NAME");
String passWord = (String) request.getParameter("PASSWORD");
// Login using the given userName and passWord.
rwLoginSession.login(userName, passWord);
} else {
// Security is not on; leave the userName and password null.
rwLoginSession.login(null, null);
}
}
catch (RWLoginFailedException ex) {
throw new ServletException("Failed to login.");
}



The method RWLoginSession.loginDomain() works just like RWLoginSession.login(), except the name server is passed in as an argument that overrides the HL_OPT_NAME_SERVER and CQKEY_KEY HL_DOMAIN settings. If the name server is null, it’s equivalent to calling RWLoginSession.login().



9.5. Logging out
Upon termination of a login session, the application should call the RWLoginSession.freeLogin() method to clean up all query objects and release memory.



9.6. Creating , Setting Up and Executing Queries
After logging into RetrievalWare, create a query object using RWLoginSession method createQuery(int queryMode), which returns an RWQuery object. The query mode is one of the following public static final integers:
OPEN_FOR_QUERY Library is readable
OPEN_FOR_UPDATE Library allows updates
OPEN_RDB Client will connect to rdbqry


Note:
1. If you use OPEN_RDB, the client will connect to rdbqry when a query-related method (such as get/set property) is called the very first time. Otherwise, the high-level program will wait to connect until an RDB function is called.


2. The scope of this document is only about OPEN_FOR_QUERY.



You may combine multiple query mode values with bitwise OR as follows:
§ If OPEN_RDB is present, an rdbqry server must be configured or the call will fail.
§ If OPEN_FOR_QUERY or OPEN_FOR_UPDATE is present, a client handler (cqquery) must be configured or the call will fail.


The RWLoginSession object maintains as many queries as a user might create, within the amount of system resources available. Each query object can then be used to perform multi-threaded query processing if the HL_OPT_MULTI_THREADED option is set. You’re responsible for ensuring thread safe processing in your application.


Creating the query object:
RWQuery rwQuery = null;
rwQuery = rwLoginSession.createQuery(RWQuery.OPEN_FOR_QUERY);

Setting libraries for searching
Before you do any searches, tell RetrievalWare what libraries you want to search. Use the following RWQuery methods to set the libraries for a query object:
setQueryLibrary(RWLibrary library)
setQueryLibrary(String libraryName)
setQueryLibs(String[] libraryNames)

Single library is searched by passing single library name to the method setQueryLibrary()
Example: rwQuery.setQueryLibrary(“SampleLibrary”); // Using single library for searching.



Multiple libraries are searched by passing array of libraries or calling setQueryLibrary() several times.
Example: rwQuery.setQueryLibrary("NewsLibrary");rwQuery.setQueryLibrary("SportsLibrary");



Setting the query type:
If you don’t choose a query type, RetrievalWare will default to Concept (statistical), with expert mode off. Select a query type for each query object using RWQuery method:
void setQueryType(int type) where the type can be any of the following, defined in RWQuery class:
DO_CONCEPT_QUERY Concept query
DO_PATTERN_QUERY Pattern query
DO_BOOLEAN_QUERY Boolean query
DO_EXAMPLE_QUERY Query by example (QBE) query

To set expert mode, use RWQuery method setExpertQuery(boolean). Expert mode allows you to get the status of the query state.In particular, if you want access to theWORD_LIST_AVAILABLE, expert mode must be turned on.
Example: rwQuery.setQueryType(RWQuery. DO_CONCEPT_QUERY); // Concept Query



Setting up query property values:
RetrievalWare has a number of query properties that control the operation of searching and retrieval. Properties are defined in RWQueryProperty.



To set a property, use one of the following methods, depending upon the type of property being set:



void setProperty(RWIntegerProperty property, int value)
void setProperty(RWBooleanProperty property, boolean value)
void setProperty(RWStringProperty property, String value)
void setProperty(RWCharProperty property, char value)
void setProperty(RWFloatProperty property, double value)


Maximum documents to be retrieved:
rwQuery.setProperty(RWQueryProperty.MAX_DOCS_PROPERTY, 500);
Setting English language for searching:
rwQuery.setProperty(RWQueryProperty.LANGUAGE_PROPERTY, 1);
Expansion level property, Word expansion limit:
These properties are applicable for concept queries. In semantic expansions, query terms are expanded to related terms via the semantic network. Set which level of links in the network to go to (1–5) using EXPANSION_LEVEL_PROPERTY. Set a maximum number of expansion words for each query term with WORD_EXPANSION_LIMIT_PROPERTY.


// Expansion level set to most strongly related concepts and max expansion words set to 20.


rwQuery.setProperty(RWQueryProperty.EXPANSION_LEVEL_PROPERTY,3);


rwQuery.setProperty(RWQueryProperty.WORD_EXPANSION_LIMIT_PROPERTY, 20);


Maximum time to execute a query before it is halted.
rwQuery.setProperty(RWQueryProperty. NUM_SECS_PROPERTY, 2000);


Maximum number of fuzzy spelling expansions added to the query per word in pattern mode.


This property is applicable for pattern queries.
rwQuery.setProperty(RWQueryProperty.MAX_FUZZY_SPELL_PROPERTY, 10);


Setting Query String on a document body
You have to set the query string to be searched by calling setQueryString method.
rwQuery.setQueryString(queryString);
Parameters: queryString - the string to search for
Example: rwQuery.setQueryString(“Business Information”);



Setting Query String for fielded queries:
rwQuery.setQueryString(queryString,fieldName);
Sets a query string for a fielded query. This method can be called as many times as necessary.
Parameters: queryString - the string to search for
fieldName - the name of the library field to which the string applies
Example: rwQuery.setQueryString(“Business Information”, “Document_Title”);



Setting the sort keys
You control how the documents will be sorted by specifying a sort key. The sort keys are ordered according to their sequence number. Each sort key has a sequence number associated with it, which indicates the sort priority for the key. The key with the lowest sequence number is the primary sort key, the next higher number is the secondary sort key, the next higher number is the tertiary sort key, and so on.



There are 2 steps involved in setting a sort key.
1. Instantiate an RWDocSortKey object
2. Call setSortKey method of RWQuery class.


RWDocSortKey(int keyType,String fieldName,int sequence,boolean isAscending)
Example:
//Sorting by relevance
RWDocSortKey key1 = RWDocSortKey(RWDocSortKey.SORT_BY_FINE_RANK, "", 1, true);
rwQuery.setSortKey(key1);

//Sorting by field value
RWDocSortKey key2 = RWDocSortKey(RWDocSortKey.SORT_BY_FIELD_VALUE,
RWLibraryField.DOC_TITLE_FIELD, 1, true);
rwQuery.setSortkey(key2);


Executing a query
To actually execute the search, use the following methods in RWQuery:


int execute() -Returns the query state (described below).
int executeToNextState() -Continues executing the query until the next state is reached.
int executeToCompletion()- Continues executing the query until the COMPLETE state is reached.





Determining the number of returned documents
After executing the query, your query program will need to list and sort the matching documents. Also, it will have to return information about the documents, including their ranks, hits, fields, and field flags. The first step is to determine the number of documents that matched the query, using the RWQuery method getResultDocCount().



Example:
// After executing query
int docCount = rwQuery.getResultDocCount();


This returns the number of returned documents for the last query.



Re-sorting the document list
Once your query has completed and you have a document list, you can change the way this list is sorted. You can do this when you have done a BOOLEAN_QUERY, a CONCEPT_QUERY, or a PATTERN_QUERY.


To re-sort a document list:
1. Delete the sort keys using RWQuery method deleteAllSortKeys()
2. Set up new sort keys
3. Use RWQuery method resortDocList() to resort the list.
RetrievalWare will take the documents in the current list for the specified query object and will reorganize them based upon your sort setting. Note that RetrievalWare does not retrieve additional documents from the library (or libraries).



Example:
// Delete old sort keys
rwQuery.deleteAllSortKeys();
// Set the primary sort key of title field.
int keySequences = 1;
boolean ascending = true;
RWDocSortKey key = new RWDocSortKey(RWDocSortKey.SORT_BY_FIELD_VALUE,
RWLibraryField.DOC_TITLE_FIELD, keySequences,
ascending);
rwQuery.setSortKey(key);
// Resort the result list.
rwQuery.resortDocList();



Getting document information
Use the following RWQuery methods to get documents and document field values from the returned document list:
RWResultDoc getResultDoc(int docNum) : Returns an object that corresponds to a document from the query result set. Documents are numbered from 1 to document count.
String getResultDocField(int docNum, String fieldName) : Returns the value of the specified stored document field of a given document.
Once you have a result document object, use the following RWResultDoc methods to get information .



Example:
This example demonstrates getting document information for the first document in the list:
// After executing a query
// Get information in the first document in list.
RWResultDoc doc = rwQuery.getResultDoc(1);
// Display the document information.
out.println("doc ID: " + doc.getDocId() + ", Title: " + doc.getDocTitle() + ", Number of Hits: " + doc.getHitCount() + ", From library: " + doc.getLibraryName());


No comments: