Home \| Previous Page

How to Search a WAIS Database

The WAIS search engine is at the heart of the WAIS Server and Workstation products. The WAIS search engine receives a user's question, searches its database for documents most relevant to the question, and returns a relevance-ranked list of documents back to the user. Each document is given a score from 1 to 1000, based on how well it matched the user's question (how many words it contained, their importance in the document, etc.). A question is an expression containing a combination of natural language, relevant documents, and boolean terms. Other key features of the WAIS search engine include fielded search, right truncation (wildcard searching), and relevance ranking.

Natural Language

The server can be queried using natural language questions. The server does not understand the question, rather it takes the words and phrases in the question and finds documents that have those words and phrases in them. "Tell me about portable computers." is an example of a natural language question. In this example, the WAIS Server would search for documents containing the words 'portable' and 'computers'; the other words, 'tell', 'me', and 'about', are called "stop words" -- they are so common that they occur in almost every document, so they are not used for searching a document.

Boolean Operators

The boolean operators, AND, OR, NOT, and ADJ aid in establishing logical relationships between concepts expressed in natural language. These operators are especially useful in narrowing down the search.

The AND operator is helpful in restricting a search when a particular pair of terms is known. For instance, when searching for documents on the weather in Boston, a question such as "weather AND Boston" would return only those documents that contain both the word "weather" and the word "Boston".
The OR operator is often used to join two different phrases of a Boolean search. A question such as "hurricane OR tornado" would search for all documents containing either the word "hurricane", or the word "tornado", or both. A natural language question is much like having an implicit OR between the words, except that the search engine does more work in a natural language query to determine the relevance of words and their relationships in a phrase.
The NOT operator is used to reject any documents that contain certain words. The question "basketball NOT college" would find all documents containing the word "basketball", that also do not contain the word "college". (Note, however, that this question would eliminate articles on any professional players that mention their alma maters!)
The adjacent operator, ADJ, is used to ensure that one word is followed by another in the returned document, with no other words in between. For example, "cordless ADJ telephone" returns only documents with exactly "cordless telephone" and not any documents that only contain the words "cordless" and "telephone" separately. Mixed Natural Language And Boolean Operators Unique to the WAIS Inc server is the ability for users to combine natural language and boolean operators to better target their searches. For example, suppose you were looking for documents specifically on portable laptop computers that are not made by Apple. The question could then be "Tell me about portable laptop computers NOT Apple.".

Fielded Search

For data collections whose documents are structured in a semi-regular format, the regular portions of the documents can be tagged by the WAIS parser as fields. A client can then ask a WAIS server to limit its search to those documents containing a user-specified value of a particular field. This is called a "Fielded Search".

The mail-or-rmail parse format is an example of a parse format in which fields are tagged. For this parse format, the WAIS parser detects the "to" and "cc" fields, the "from" and "sender" fields, the "subject" field, and the "date" field. An example of a question using natural language, a boolean operator, and fielded search is: "company picnic AND from=barbara". The WAIS server would then return documents containing messages about a company picnic that barbara sent.

Right Truncation (Wildcards)

A user can specify right truncation by ending a word with the asterisk ('*') wild card character. This tells the search engine to search on words matching the base characters before the '*' and to ignore any trailing characters. For example, you might use right truncation in a question such as "geo*", which may retrieve documents containing the words: geographer, geography, geologist, geometry, geometrical, etc.

Grouping Search Terms

A user can group search terms and phrases together using parentheses. For example, if you wished to search for information about snowstorms, tornadoes, or hurricanes in New York City, you might search for "(snowstorms OR tornadoes OR hurricanes) AND (New ADJ York ADJ City)." You can also nest your parentheses; for example, "from = ( (ben ADJ wais) OR (brewster ADJ think) )" searches for messages from either ben@wais.com or brewster@think.com.

Relevance Ranking

Each document is scored based on its relevance to a user's question, where the most relevant document has the highest score, or rank -- 1000 being the highest, 1 being the lowest. A document receives a higher score if the words in the question are in the headline, or if the words appear many times, or if phrases occur as in the question. A document's score is derived using techniques such as word weighting, term weighting, proximity relationships, and word density. Note that questions made up of natural language, relevant documents, and boolean expressions are all weighted using these techniques.

Word Weight

If a word in a document is found to match a word in the user's question, the word is assigned a weight, and this weight adds to the overall score of the document. The exact weight that a word receives depends on the emphasis given to the word by the author, and on where in the document the word was found. For example, a word is weighted highest if it appears in the headline, lower if the word has all capital letters or if the first letter of the word is capitalized, and finally, lowest if it appears only in the text. The WAIS parser determines word weights as it reads through the original data collection.

Term Weight

Each word used in data collection is assigned a numerical value, called the term weight, based on the frequency of occurrence of that word over all documents in the data collection. Words that occur frequently are not weighted as highly as those that appear less frequently. Very common words are either ignored or diminished in the scoring. For example, since the term, "animal", may occur frequently in many of the documents in a data collection, its term weight is small compared to a term such as "hippopotamus", which may occur only a few times.

Proximity Relationships

Proximity relationships designate that if the words in a natural language question are located close together in a document, they are given a higher weight than those found further apart. The idea behind a proximity relationship is that if a document contains a phrase similar to one in the user's question, that document is more likely to be relevant.

Word Density

The ratio of the number of times a word appears in a document to the size of the document is called the word density. It is a measure of how important a word is to the overall content of the document. A higher word density results in a higher relevance ranking.

Courtesy of WAIS Inc.

http://www.sec.gov/edgar/searchedgar/waishelp.htm