Added by Arjé Cahn, last edited by Jasha Joachimsthal on Jun 05, 2008  (view change)

Labels:

dasl dasl Delete
queries queries Delete
webdav webdav Delete
search search Delete
Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

Hippo Repository implements Apache Slide, which has basic DASL capabilities. Hippo uses a custom Lucene DASL implementation in Slide to optimize the query performance on the repository. On this page some tags are explained that make use of the Lucene index.

NOTE: The shorthand "S" stands for the slide namespace xmlns:S="http://jakarta.apache.org/slide/".

Table of contents

DASL basics

The actual DASL query is wrapped within a Webdav Transformer wrapper:

<request xmlns="http://hippo.nl/webdav/1.0"
    xmlns:d="DAV:"
    xmlns:hc="http://hippo.nl/cms/1.0"
    target="webdav://yourhost/yournamespace/some/path"
    method="SEARCH">
  <body>
    <!-- Your DASL -->
  </body>
</request>

where

  • method: which webdav method to use, here "SEARCH"
  • target: This is the uri the SEARCH method is used on, and the base for the search
    An example of a slightly more complex than basic search query (goes within the body of the wrapper above):
<d:searchrequest xmlns:d="DAV:" xmlns:slide="http://jakarta.apache.org/slide/" xmlns:h="http://hippo.nl/cms/1.0">
  <d:basicsearch>
    <d:select>
      <d:prop>
        <h:caption/>
        <d:displayname/>
        <h:index/>
        <h:type/>
        <h:newsdate/>
        <h:title/>
      </d:prop>
    </d:select>
    <d:from>
      <d:scope>
        <d:href>content</d:href>
        <d:depth>infinity</d:depth>
      </d:scope>
      <d:scope>
        <d:href>binaries/pdf</d:href>
        <d:depth>1</d:depth>
      </d:scope>
    </d:from>
    <d:where>
      <d:and>
        <d:contains>amsterdam</d:contains>
        <d:not-is-collection/>
        <d:not>
          <d:eq>
            <d:prop>
              <h:type/>
            </d:prop>
            <d:literal>employees</d:literal>
          </d:eq>
        </d:not>
      </d:and>
    </d:where>
    <d:orderby>
      <d:order>
        <d:prop><h:newsdate/></d:prop>
        <d:descending/>
      </d:order>
    </d:orderby>
    <d:limit>
      <d:nresults>10</d:nresults>
    </d:limit>
  </d:basicsearch>
</d:searchrequest>
  • The DAV:select element contains one element per property that should be returned in the search
  • The DAV:from element contains one or more "search scopes" describing where and how deep to search in the tree
  • DAV:where contains some expressions to narrow the search. This element can be empty, meaning matching everything. See below for the different possible expressions.
  • DAV:orderby contains one or more DAV:order elements which define a sorting for the returned results. For each DAV:order element, a property to sort by and a sort direction (either DAV:ascending or DAV:descending) have to be defined.
  • DAV:limit has the same purpose as in MySQL: The DAV:nresults element defines how many result resources are wanted at most.

New Elements/Behaviors in Hippo-Repository v1.2.0

  • DAV:limit now also works without the DAV:orderby element
  • DAV:limit may now also include a <S:offset xmlns:S="http://jakarta.apache.org/slide/"/> element, which defines an offset number (integer greater than or equal to 0) to start returning resources from. See the later section about <DAV:limit/> for details.
  • DAV:select can now select three more properties:
    • DAV:score, an integer from 0 to 1000 which tells the "relevance" of this resource given the complete query.
    • S:hitPosition, the position of this resource in the result of the lucene search in the background.
    • S:nrHits, the total hits the query returned. Note that this is the total number of hits, including resources the user might not be able to see.

DASL Scope definition

<DAV:scope/> allows you to specify where the search results should come from in the file heirarchy. All paths mentioned here are relative to the method's target.

Within the <DAV:from/> element, multiple <DAV:scope/>'s are allowed. This allows for searching in multiple disjoint heirarchies at the same time. The above is an example of searching both the "content" and "binaries/pdf" paths, but not any other subfolders of "webdav://yourhost/yournamespace/some/path" (the target of the request).

Each <DAV:scope/> must have these child elements:

  1. <DAV:href/>, which may contain a relative path (relative to the target), or an absolute path on the server to search.
  2. <DAV:depth/>, which defines how deep to search the path given above. Admissable values are:
    1. 0 - meaning only search on the uri given (don't know why this is useful...)
    2. 1 - meaning only search on the uri and its direct children (like all resources in a collection)
    3. infinity - meaning search everything under the uri, including the uri itself (i.e. recursively)

Each <DAV:scope/> may also have these child elements:

  1. <S:minimum-depth/>, which defines a depth where results should start to be returned from. A value of "1" would effectively skip a collection the <DAV:href/> may reference. Any positive integer is admissable.
  1. <S:exclude/> (multiple are allowed), which defines a path that should not be included in the search result. This way, it is possible to explicitly exclude files and whole directories from the result set.

DASL Expressions

Merge expressions (Expressions which consist of others)

<DAV:or/>

Boolean expression, true if any of the sub-expressions are true.

<DAV:and/>

Boolean expression, true if all of the sub-expressions are true.

<DAV:not/>

Boolean expression, negates the child expressions.

All following expressions are also available in their negated form. For the negated form, just prefix the names with 'not-'. E.g.

<DAV:eq/> \-> <DAV:not-eq/>

Comparison expressions

These are of the form:

<DAV:operator>
  <DAV:prop> YOUR PROPERTY TO CHECK </DAV:prop>
  <DAV:literal> THE VALUE TO CHECK AGAINST </DAV:literal>
</DAV:operator>
<DAV:eq/>

Checks if the value of the given property equals the literal value you specified.

<DAV:gt/>

Checks if the value of the given property is greater than the literal value you specified.

<DAV:gte/>

Checks if the value of the given property is greater than or equal to the literal value you specified.

<DAV:lt/>

Checks if the value of the given property is less than the literal value you specified.

<DAV:lte/>

Checks if the value of the given property is less than or equal to the literal value you specified.

<DAV:like/>

Checks if the value of the given property matches the literal. Basically equal to<S:propcontains/>, but you may use the '%' and '?' characters like in SQL.

Other DAV Expressions

<DAV:is-defined>
<DAV:is-defined>
   <DAV:prop>YOUR PROPERTY </DAV:prop>
</DAV:is-defined>

Checks if the given property exists on a resource.

<DAV:is-collection/>

Checks if the resource is a collection or not.

<DAV:contains>
<DAV:contains>
    YOUR Lucene QUERY
</DAV:contains>

Matches resources which have a content and match the given query. See http://lucene.apache.org/java/docs/queryparsersyntax.html for the query syntax. Use DAV:contains only if you want to search inside the whole document content.

Extra Slide Expressions, NON-STANDARD

The namespace is:
xmlns:S="http://jakarta.apache.org/slide/"

<S:between>
<S:between>
    <DAV:prop>YOUR PROPERTY</DAV:prop>
    <DAV:literal>LOWER BOUND</DAV:literal>
    <DAV:literal>UPPER BOUND</DAV:literal>
</S:between>

Checks if the value is between the two literals given.

<S:between-inclusive/>

Like between, but the range includes the literals.

<S:is-principal/>

Checks if the resource is a principal (user/role/action node) or not.

<S:is-version-history/>

Checks if the resource is a version-history or not.

Some old and new friends of the form:

<S:operator>
    <DAV:prop>YOUR PROPERTY TO CHECK</DAV:prop>
    <DAV:literal>THE VALUE TO CHECK AGAINST</DAV:literal>
</S:operator>
<S:(not-)property-contains/>

The given property has to be of type text. This operator will search for a whole word in the property and its best use is for the "search for entries in a comma-separated list" case in conjunction with the "LowercaseCommaSeparatedAnalyzer" mentioned below. Just like with the <DAV:contains/> you can use the Lucene query syntax.

<S:(not-)strict-property-contains/> (1.2.8 and higher ONLY)

Like <S:property-contains/>, but uses the analyzer given in the indexer configuration to parse the value given as the literal. This is most useful for comma-separated lists in which items contain spaces, like "zero,number one,two". You can use this expression to search for "number one" and find the item with that list. The query parser in <S:property-contains/> would treat the space as a seperator while it isn't really one and the <S:property-contains/> query would not find that list because it searches for "number" and "one" instead of "number one".
If the analyzer finds multiple words, like items in a comma-separated list, it will require that all items are present, i.e. AND them.

<S:(not-)propcontains/>

Like the old friend from pre-1.2 repositories, this one searches for substrings in a property. The type of the property has to be either string or text. This is rather slow and resource-intensive, so try to make it a point to use <S:property-contains/> instead, or even better, use <S:propsearch/>.

<S:propsearch/> (1.2.12 and higher ONLY)

From repository 1.2.12 WebDAV properties on a document which are indexed as UN_TOKENIZED STRINGS, are now also indexed as TOKENIZED TEXT. This means, you can search properties now also like you were used to with a normal text search, with <DAV:contains/>. In <S:propsearch/> you can use ordinary Lucene query syntax, so for example boosting or fuzzy searches.

The Limit expression

With <DAV:limit/> you can express limitations on the result set in terms of how many results are returned. There are two possible child elements:

  1. <DAV:nresults/>, which contains an integer defining how many results should be generated at most, there may be less if the search matches less resources.
  2. <S:offset xmlns:S="http://jakarta.apache.org/slide/"/>, which tells the search to start generating results from a certain offset on, very similar to the (nonstandard) SQL expression.

Important: <S:offset xmlns:S="http://jakarta.apache.org/slide/"/> is not in the sequence of returned results, but in the sequence of hits returned from Lucene, which is the backend used within Hippo Repository. Some of the hits might not be visible by the searching user due to permissions and would thus be skipped. In order to properly use this offset, you have to keep track of the hits from lucene by selecting the <S:hitPosition/> property and using the last value you received in the previous search plus one as the offset for the next.

General Performance Tips:

  • The number of results is the bottleneck, nothing else, so always use <DAV:limit/>. If you need pagination, you can keep track of the position of your results using the <S:hitPosition/> property described above.
  • Since evaluating the DASL itself is so cheap, make it a point to have one DASL per job. With previous versions, it was sometimes necessary to use two or more aggregated DASLs. If possible, use ONE so you don't have to do any post processing of the results.

Slide Namespace Configuration

The Namespace configuration in domain.xml:

The Namespace in domain.xml:

<namespace name="space">
  <definition xsrc="config/slide/space/definition.xml"/>
  <configuration xsrc="config/slide/configuration.xml"/>
  <data xsrc="config/slide/data.xml"/>
  <extractors xsrc="config/slide/space/extractors.xml"/>
  <indexer class="nl.hippo.slide.index.LuceneIndexerDASLImpl">
    <cron>0 * * * * ? *</cron>
    <indexpath>../slide_indices/space</indexpath>
    <analyzer class="nl.hippo.slide.index.analysis.SimpleStandardAnalyzer"/>
    <default-property-analyzer class="nl.hippo.slide.index.analysis.SimpleStandardAnalyzer"/>
    <!-- This is false by default (meaning all searches are always case-insensitive in respect to the values
           of the properties and so on. If you set this to true, searches will become case-sensitive, meaning
          "HiPPo" and "hIpPo" will be considered to be different values. -->
   <case-sensitive>false</case-sensitive>
   <index-all/>
   <properties>
     <property namespace="DAV:" name="getcontenttype" type="string" support-defined="true"/>
     <property namespace="DAV:" name="getlastmodified" type="date"/>
     <property namespace="DAV:" name="creationdate" type="date"/>
     <property namespace="DAV:" name="getcontentlength" type="int"/>
     <property namespace="http://hippo.nl/cms/1.0" name="keyword" type="text"
     analyzer="nl.hippo.slide.index.analysis.LowercaseCommaSeparatedAnalyzer"/>
   </properties>
   <!-- These are slide internal types:
   collection: this resource is a folder.
   principal: this resource is a user, role or action node.
   version-history: this resource describes some previous state in the history.

   Since the values of the resourcetype property are XML, they have to be converted into plain text,
   this is why this configuration exists below. We check if it is of one of these known types and than just
   index this plain text name instead of some XML string.
   -->
   <resource-types>
     <resource-type name="collection" namespace="DAV:"/>
     <resource-type name="principal" namespace="DAV:"/>
     <resource-type name="version-history" namespace="DAV:"/>
   </resource-types>

   <!-- MEMORY OPTIMIZATION SETTINGS, these are all optional -->
   <!--
     buffered-docs gives the number of documents to be buffered in memory.
     The higher, the more RAM used, the faster is the indexing and optimization.
     Default: 100
     see http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxBufferedDocs(int)
   -->
   <buffered-docs>1000</buffered-docs>
   <!--
     merge-docs gives the number of documents merged in one file, set it very high!
     Default: Integer.MAX_VALUE
     Since the default value is already so large, you should not need to override it!
     see http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxMergeDocs(int)
   -->
   <!--merge-docs>10000000</merge-docs-->
   <!--
     Determines how often segment indices are merged
     10 means: each time a tenth segment needs to be made on disk, merge it with another
     >10 -> more ram used, slower searches on unoptimized index (more files), faster indexing
     <10 -> less ram used, quicker searches (less files), slow indexing
     Defaut: 10
     see http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMergeFactor(int)
   -->
   <merge-factor>10</merge-factor>
   <!--
     Optimize-docs gives the number of documents (files and folders in slide) after which to run the optimization
     on the index. This is only used for optimizing the index, not for updating it.
     Basically, this number of documents have to change until lucene optimizes the index again for searching. Good
     for high-volume editing (many users of the cms).
     Search speed degrades a little with larger values as more files are present on average. Adjust this value to
     website size and average rate of editing in the cms.
     Default: 100
   -->
   <optimize-docs>50</optimize-docs>
 </indexer>
</namespace>

The above example shows a configuration for the indexer needed for the Lucene-based DASLs to work. Notice that you can assign 'types' to properties, like 'int', 'date', 'text', and 'string'. The default type is 'string'. If you specify the

<index-all/>

element (like above), all properties will be indexed.

Note that you can also use the external version of the config like we use it with the extractors and so on. You can also write:

<indexer xsrc="config/slide/space/indexer.xml"/>

The types mean:

'int':The value present in the property will be treated as an integer value and stored in a way so that range queries and real sorting are possible.

'date':The property will be parsed as a date, works with the standard DAV getlastmodified and creationdate properties. For the hippo types, this shouldn't be possible since the format is already string friendly.

'text':The property is a text which needs to be tokenized (broken up into words) by a Lucene analyzer. The classname needs to be in the configuration. If no analyzer is given, the default-property-analyzer is used. This type is extremely useful for comma-separated-lists where you want to search for a specific entry in the list. Use this type in conjunction with the <S:property-contains/> expression.

'string':The default type, which only indexes the string present in the property.

You also need two more parameters in your definition.xml:

<parameter name="basicQueryClass">org.apache.slide.search.basic.LuceneBasicQuery</parameter>
<parameter name="basicQueryEnvelopeClass">org.apache.slide.search.basic.LuceneBasicQueryEnvelope</parameter>

These belong just before the nodestore element in your main store (usually the SQL store, which maps to "/"). Without these, the DASLs will still be very slow!

Notes about specific DASL Tags

D:eq (equal)

With this constraint, you will only find resources which actually have this property set, never ones which do not have this property at all. So, if you search for resources with

<D:not-eq>

or

<D:not><D:eq>

you will only find resources with this property present but having a different value than you specified.

D:gt and D:lt (greater-than and less-than)

Lucene seems to optimize the query or maybe it needs to do some preprocessing to be able to evaluate the query afterwards. It seems to remove "impossible" criteria, like having a gt with a larger argument than a lt. It also collapses an lt and a gt for the same field into one range query.

That's why something like

<D:and>
  <D:lt>
    <D:prop><someintegerprop/></D:prop>
    <D:literal>123</D:prop>
  </D:lt>
  <D:gt>
    <D:prop><someintegerprop/></D:prop>
    <D:literal>124</D:prop>
  </D:gt>
</D:and>

will not work as basically, these two conditions are dropped.

<D:and>
  <D:lt>
    <D:prop><someintegerprop/></D:prop>
    <D:literal>123</D:prop>
  </D:lt>
  <D:gt>
    <D:prop><someintegerprop/></D:prop>
    <D:literal>122</D:prop>
  </D:gt>
</D:and>

will become a range query from 122 to 123 (not two separate less-than and greater-than).

S:between

Lucene will actually automatically order the two bounds you specify in the dasl such that the smaller one becomes the lower bound and the bigger one becomes the upper bound. In this way, you don't get results completely outside the range you meant.

References

See http://www.webdav.org/dasl for the DASL specifications
Some DASL notes are available on the Slide Wiki