Added by Dennis Dam, last edited by Dirkjan van Diepen on Jan 10, 2008  (view change) show comment

Labels:

broken broken Delete
link link Delete
kapotte kapotte Delete
interne interne Delete
Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

Table of contents

Purpose

The purpose of the link checker is to check links to external websites as well as internal links (links to repository documents) which occur in the content of XML documents.

Current version

The information below is about the rewritten broken links checker that was added to Hippo CMS in version 6.05.02.

Hippo CMS version v6.05.02 and up

Reasons for the rewrite

The broken link checker was rewritten for the following reasons:

  • the DASL limitation of a maximum of 1000 results was not taken into account which prevented all documents from being checked for broken links
  • the checking of links took a long time because only a single thread was used. This thread spent most of its time waiting for a response from the servers pointed to by the links. The current version uses multiple threads to check a number of links simultaneously
  • the previous implementation heavily depended on the CMS and Avalon. The current implementation can also be run outside the CMS and does not require an Avalon container

Running inside the CMS

To be able to run the broken link checker inside the CMS, a wrapper which allows the broken link checker to be managed by Avalon has been written: nl.hippo.cms.brokenlinkchecker.avalon.BrokenLinkCheckerAvalonWrapper. To use this wrapper, the following three things must be done:

  • declare the component so Avalon will instantiate the wrapper
  • add the configuration declaration for the wrapper
  • set the configuration in the build properties

Component declaration

Here is how the component can be declared in user.xroles:

<role
    default-class="nl.hippo.cms.brokenlinkchecker.avalon.BrokenLinkCheckerAvalonWrapper"
    name="BrokenLinkChecker"
    shorthand="broken-link-checker"/>

This declaration is added during the build by the xpatch script stored in src/config/broken-link-checker.xroles.

Component configuration

Just declaring the component is not enough. The component must be configured too. This is done by adding the following to cocoon.xconf:

<broken-link-checker>
    <parameter name="enabled" value="true"/>
    <parameter name="role-of-this-component" value="BrokenLinkChecker"/>
    <parameter name="job-name" value="BrokenLinksCheckerJob"/>
    <parameter name="cron-expression" value="0 0 4 * * ?"/>
    <parameter name="document-tree-to-check-root-url" value="http://localhost:60000/default/files/default.www/content/bulk"/>
    <parameter name="documents-base-url" value="/default/files/default.www"/>
    <parameter name="internal-url-prefixes-to-ignore" value="/assets/binaries/ /binaries/"/>
    <parameter name="internal-links-base-url" value="http://localhost:60000/default/files/default.www"/>
    <parameter name="repository-username" value="root"/>
    <parameter name="repository-password" value="password"/>
    <parameter name="result-document-url" value="http://localhost60000/default/files/default.www/broken-links.xml"/>
    <parameter name="document-batch-size" value="100"/>
    <parameter name="number-of-link-checking-threads" value="10"/>
    <parameter name="link-check-timeout-seconds" value="10"/>
</broken-link-checker>

This configuration is added during the build by the xpatch script stored in src/config/broken-link-checker.xconf.

Build properties

The configuration of the Avalon wrapper can be set using build properties. The following properties are included in project.properties. The properties for the new version use prefix cms.brokenLinkChecker instead of cms.linkchecker to prevent confusion. If needed override them in build.properties:

# Configuration for the broken link checker. See the
# Javadoc of class 'BrokenLinkCheckerAvalonWrapper' for an
# explanation of these properties.
#
# Setting an optional property here will not cause it to
# be used. You also have to enable the corresponding XML
# element in 'broken-link-checker.xconf'.
#
cms.brokenLinkChecker.enabled=false
cms.brokenLinkChecker.role=BrokenLinkChecker
cms.brokenLinkChecker.jobName=BrokenLinksCheckerJob
# Example value: daily at 4:00 a.m.
cms.brokenLinkChecker.cronExpression=0 0 4 * * ?
cms.brokenLinkChecker.documentTreeToCheck=http://localhost:60000/default/files/default.www/content/bulk
cms.brokenLinkChecker.documentsBase=/default/files/default.www
cms.brokenLinkChecker.internalUrlPrefixesToIgnore=/assets/binaries/ /binaries/
cms.brokenLinkChecker.internalLinksBase=http://localhost:60000/default/files/default.www
cms.brokenLinkChecker.repositoryUsername=root
cms.brokenLinkChecker.repositoryPassword=password
cms.brokenLinkChecker.resultDocumentUrl=http://localhost:60000/default/files/default.www/broken-links.xml
cms.brokenLinkChecker.documentBatchSize=100
cms.brokenLinkChecker.numberOfLinkCheckingThreads=10
cms.brokenLinkChecker.linkCheckTimeoutSeconds=10

NOTE: The directory of the broken-links.xml file must exist in order to let the broken link checker work! Of course, you can point it to any directory you like, as long as it does exist.

Running outside the CMS

It is now very easy to run the broken link checker in another environment than the CMS. To do this first create an instance of nl.hippo.cms.brokenlinkchecker.BrokenLinkCheckerRunConfigurationBean, and set the appropriate values. See the Javadoc of the constructor of the configuration bean and method assertConfigurationIsValid(...) of BrokenLinkCheckerRun for information about required, optional and default values. Then create an instance of nl.hippo.cms.brokenlinkchecker.BrokenLinkCheckerRun passing the configuration just created. To start the checking of broken links invoke execute():

BrokenLinkCheckerRunConfigurationBean config = new BrokenLinkCheckerRunConfigurationBean();
// Set values on 'config'
BrokenLinkCheckerRun run = new BrokenLinkCheckerRun(config);
run.execute();
Deprecated version

The rest of the information on this page is about the deprecated version.
Hippo CMS version v6.05.02-dev and up

Requirements

The following things have to be configured:

  • The links in a document should be extracted by slide in the property http://hippo.nl/cms/1.0:links. The links property should be a space-separated value of valid urls. This can be achieved by using the UrlListXMLPropertyExtractor) property extractor.
    Configure the following CMS property:
    <extractor classname="nl.hippo.slide.extractor.UrlListXMLPropertyExtractor" uri="/files/default.preview" content-type="text/xml | text/xml; charset=UTF-8 | application/xml">
        <configuration>
          <instruction property="links" namespace="http://hippo.nl/cms/1.0" xpath="//@href|//@src"/>
        </configuration>
      </extractor>

    Moreover please configure the indexer as follows:

    <property namespace="http://hippo.nl/cms/1.0" name="links" type="text" analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
  • Configure the system user CMS properties, e.g.:
    maven.cocoon.repository.systemcredentials.username=root
    maven.cocoon.repository.systemcredentials.password=password

    These properties are used by the linkchecker to crawl the repository.

Enabling the link checker in the CMS

This will activate a background task inside the CMS, which will periodically check links in XML documents. The result of a linkchecker run is written to an XML file at the following location:

repository://configuration/brokenlinks/brokenlinks.xml

Make sure the directory of the file, set in the property 'cms.brokenLinkChecker.resultDocumentUrl', exists in the repository.

The linkchecker background task is disabled by default, but when turned on, will run every day at 3:30 AM by default.
There are two ways of configuring when the linkchecker task should run:

  • using a cron expression, which is the default (and most flexible) method to schedule the task. The default configuration is:
    cms.linkchecker.cron=0 30 3 ? * *

    See this page for a detailed description of the cron expression syntax.

  • using an interval specified in seconds using the following CMS build property:
    cms.linkchecker.interval=86400

    This is a simple configuration. The time and date when the task runs is not specified, this depends on when the CMS was started.