Table of contents
Purpose
The purpose of the link checker is to check links to external websites as well as internal links (links to repository documents) which occur in the content of XML documents.
| Current version The information below is about the rewritten broken links checker that was added to Hippo CMS in version 6.05.02. |
| Hippo CMS version | v6.05.02 and up |
Reasons for the rewrite
The broken link checker was rewritten for the following reasons:
- the DASL limitation of a maximum of 1000 results was not taken into account which prevented all documents from being checked for broken links
- the checking of links took a long time because only a single thread was used. This thread spent most of its time waiting for a response from the servers pointed to by the links. The current version uses multiple threads to check a number of links simultaneously
- the previous implementation heavily depended on the CMS and Avalon. The current implementation can also be run outside the CMS and does not require an Avalon container
Running inside the CMS
To be able to run the broken link checker inside the CMS, a wrapper which allows the broken link checker to be managed by Avalon has been written: nl.hippo.cms.brokenlinkchecker.avalon.BrokenLinkCheckerAvalonWrapper. To use this wrapper, the following three things must be done:
- declare the component so Avalon will instantiate the wrapper
- add the configuration declaration for the wrapper
- set the configuration in the build properties
Component declaration
Here is how the component can be declared in user.xroles:
<role
default-class="nl.hippo.cms.brokenlinkchecker.avalon.BrokenLinkCheckerAvalonWrapper"
name="BrokenLinkChecker"
shorthand="broken-link-checker"/>
This declaration is added during the build by the xpatch script stored in src/config/broken-link-checker.xroles.
Component configuration
Just declaring the component is not enough. The component must be configured too. This is done by adding the following to cocoon.xconf:
<broken-link-checker> <parameter name="enabled" value="true"/> <parameter name="role-of-this-component" value="BrokenLinkChecker"/> <parameter name="job-name" value="BrokenLinksCheckerJob"/> <parameter name="cron-expression" value="0 0 4 * * ?"/> <parameter name="document-tree-to-check-root-url" value="http://localhost:60000/default/files/default.www/content/bulk"/> <parameter name="documents-base-url" value="/default/files/default.www"/> <parameter name="internal-url-prefixes-to-ignore" value="/assets/binaries/ /binaries/"/> <parameter name="internal-links-base-url" value="http://localhost:60000/default/files/default.www"/> <parameter name="repository-username" value="root"/> <parameter name="repository-password" value="password"/> <parameter name="result-document-url" value="http://localhost60000/default/files/default.www/broken-links.xml"/> <parameter name="document-batch-size" value="100"/> <parameter name="number-of-link-checking-threads" value="10"/> <parameter name="link-check-timeout-seconds" value="10"/> </broken-link-checker>
This configuration is added during the build by the xpatch script stored in src/config/broken-link-checker.xconf.
Build properties
The configuration of the Avalon wrapper can be set using build properties. The following properties are included in project.properties. The properties for the new version use prefix cms.brokenLinkChecker instead of cms.linkchecker to prevent confusion. If needed override them in build.properties:
# Configuration for the broken link checker. See the # Javadoc of class 'BrokenLinkCheckerAvalonWrapper' for an # explanation of these properties. # # Setting an optional property here will not cause it to # be used. You also have to enable the corresponding XML # element in 'broken-link-checker.xconf'. # cms.brokenLinkChecker.enabled=false cms.brokenLinkChecker.role=BrokenLinkChecker cms.brokenLinkChecker.jobName=BrokenLinksCheckerJob # Example value: daily at 4:00 a.m. cms.brokenLinkChecker.cronExpression=0 0 4 * * ? cms.brokenLinkChecker.documentTreeToCheck=http://localhost:60000/default/files/default.www/content/bulk cms.brokenLinkChecker.documentsBase=/default/files/default.www cms.brokenLinkChecker.internalUrlPrefixesToIgnore=/assets/binaries/ /binaries/ cms.brokenLinkChecker.internalLinksBase=http://localhost:60000/default/files/default.www cms.brokenLinkChecker.repositoryUsername=root cms.brokenLinkChecker.repositoryPassword=password cms.brokenLinkChecker.resultDocumentUrl=http://localhost:60000/default/files/default.www/broken-links.xml cms.brokenLinkChecker.documentBatchSize=100 cms.brokenLinkChecker.numberOfLinkCheckingThreads=10 cms.brokenLinkChecker.linkCheckTimeoutSeconds=10
NOTE: The directory of the broken-links.xml file must exist in order to let the broken link checker work! Of course, you can point it to any directory you like, as long as it does exist.
Running outside the CMS
It is now very easy to run the broken link checker in another environment than the CMS. To do this first create an instance of nl.hippo.cms.brokenlinkchecker.BrokenLinkCheckerRunConfigurationBean, and set the appropriate values. See the Javadoc of the constructor of the configuration bean and method assertConfigurationIsValid(...) of BrokenLinkCheckerRun for information about required, optional and default values. Then create an instance of nl.hippo.cms.brokenlinkchecker.BrokenLinkCheckerRun passing the configuration just created. To start the checking of broken links invoke execute():
BrokenLinkCheckerRunConfigurationBean config = new BrokenLinkCheckerRunConfigurationBean(); // Set values on 'config' BrokenLinkCheckerRun run = new BrokenLinkCheckerRun(config); run.execute();
| Deprecated version The rest of the information on this page is about the deprecated version. |
| Hippo CMS version | v6.05.02-dev and up |
Requirements
The following things have to be configured:
- The links in a document should be extracted by slide in the property http://hippo.nl/cms/1.0:links. The links property should be a space-separated value of valid urls. This can be achieved by using the UrlListXMLPropertyExtractor) property extractor.
Configure the following CMS property:<extractor classname="nl.hippo.slide.extractor.UrlListXMLPropertyExtractor" uri="/files/default.preview" content-type="text/xml | text/xml; charset=UTF-8 | application/xml"> <configuration> <instruction property="links" namespace="http://hippo.nl/cms/1.0" xpath="//@href|//@src"/> </configuration> </extractor>
Moreover please configure the indexer as follows:
<property namespace="http://hippo.nl/cms/1.0" name="links" type="text" analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
- Configure the system user CMS properties, e.g.:
maven.cocoon.repository.systemcredentials.username=root maven.cocoon.repository.systemcredentials.password=password
These properties are used by the linkchecker to crawl the repository.
Enabling the link checker in the CMS
This will activate a background task inside the CMS, which will periodically check links in XML documents. The result of a linkchecker run is written to an XML file at the following location:
repository://configuration/brokenlinks/brokenlinks.xml
Make sure the directory of the file, set in the property 'cms.brokenLinkChecker.resultDocumentUrl', exists in the repository.
The linkchecker background task is disabled by default, but when turned on, will run every day at 3:30 AM by default.
There are two ways of configuring when the linkchecker task should run:
- using a cron expression, which is the default (and most flexible) method to schedule the task. The default configuration is:
cms.linkchecker.cron=0 30 3 ? * *
See this page for a detailed description of the cron expression syntax.
- using an interval specified in seconds using the following CMS build property:
cms.linkchecker.interval=86400
This is a simple configuration. The time and date when the task runs is not specified, this depends on when the CMS was started.