DictionaryAnnotator

The DictionaryAnnotator is an annotator that can be used for word matching. It will load one of the supported dictionary formats, either from a file or HTTP resource, and annotates all occurrences of the patterns. Fonto Content Quality is case sensitive when searching for dictionary occurrences.

Patterns are strings consisting of one or more characters. Note that wildcard or regular expressions are NOT supported by this annotator. Matching is done on whole words, where a word boundary is based on the following characters [A-Za-z0-9], or more specifically the characters as supported by the .NET Char.isLetterOrDigit() method.

Configuration

The DictionaryAnnotator must be configured with a source and format.

Source

The dictionary source can be one of the following, and have configuration attributes themselves.

fileSource

The fileSource loads the dictionary from a file on disk. The file will be automatically monitored for changes.

relativePath

The relative path to the dictionary file, relative to the /Configuration folder.

Yes

N/A

Note that when using FDT to start Fonto Content Quality, your dictionary file is being copied to the Docker image and that copy is being watched instead of your local dictionary file. Local changes will thus not be detected while running.

httpSource

The httpSource loads the dictionary using a HTTP request. It will periodically check for changes.

url

The url from which to request the dictionary. See the endpoint API definition below.

This attribute also supports environment variables.

Yes

N/A

Format

The dictionary format can be one of the following, and have configuration attributes themselves. The dictionaries are case sensitive.

wordListFormat

The word list format uses a plain text file which consists of one pattern per line. A line is defined as one or more characters followed by a line end ("\n", "\r", or "\r\n").

annotationTypeId

The type identifier to set on the added annotations.

Yes

N/A

Word list file example:

Other

foo
bar
baz

Annotation metadata format:

JavaScript

{
	"pattern": "The pattern as found in the dictionary file, e.g. foo"
}

solrSynonymFormat

The Solr synonym format uses a Solr synonyms file which contains one or more synonyms per line. A line is defined as one or more characters followed by a line end ("\n", "\r", or "\r\n").

Only the explicit mapping syntax is supported, this can be recognized by the "=>" on a line. All other lines are ignored.

The comma can be escaped with a backslash if it is part of your matching pattern or replacement.

annotationTypeId

The type identifier to set on the added annotations.

Yes

N/A

Solr synonym file example:

Other

# All occurrences of foo should be replaced by bar.
foo => bar

# All occurrences of foo should be replaced by bar or baz.
foo => bar
foo => baz

# All occurrences of foo and bar should be replaced by baz.
foo, bar => baz

# All occurrences of "foo, bar" should be replaced by baz.
foo\, bar => baz

# All occurrences of foo and bar should be replaced by baz or foobar.
foo, bar => baz, foobar

# All occurrences of foo and bar should be replaced by "baz, foobar".
foo, bar => baz\, foobar

# Not supported: inexplicit mappings
foo, bar, baz

Annotation metadata format:

JavaScript

{
	"synonyms": [
		"The synonym(s) as found in the dictionary on the right hand side of the =>, e.g. foo"
	],
	"pattern": "The pattern as found in the Solr synonym file, e.g. foo"
}

xmlFormat

Use this format to create a dictionary based any given XML file using XPath's.

Fonto Content Quality supports W3C XPath 1.0 expressions. See https://www.w3.org/TR/1999/REC-xpath-19991116/ for details.

itemsQuery

The XPath expression to select the items from which to create the dictionary.

Yes

N/A

patternQuery

The XPath expression to select the pattern for each item.

Yes

N/A

annotationTypeId

The type identifier to set on the added annotations.

Yes

N/A

Mapping metadata:

Within the <xmlFormat> configuration you are allowed to configure mappings using the <mapping> element to add data to the Annotation metadata:

valueQuery

The XPath expression to select the metadata value for each item.

Yes

N/A

metadataName

The metadata property name.

Yes

N/A

XML file example:

Other

<xml>
	<products>
		<product name="Fonto Content Quality" url="https://www.fontoxml.com/fonto-content-quality/">
			<owner>Fonto</owner>
		</product>
		<product name="Fonto Review" url="https://www.fontoxml.com/fonto-review/">
			<owner>Fonto</owner>
		</product>
	</products>
</xml>

jsonFormat

Use this format to create a dictionary based any given JSON file using JSONPath's.

See https://goessner.net/articles/JsonPath/ for details on JSONPath.

itemsQuery

The JSONPath expression to select the items from which to create the dictionary.

Yes

N/A

patternQuery

The JSONPath expression to select the pattern for each item.

Yes

N/A

annotationTypeId

The type identifier to set on the added annotations.

Yes

N/A

Mapping metadata:

Within the <jsonFormat> configuration you are allowed to configure mappings using the <mapping> element to add data to the Annotation metadata:

valueQuery

The JSONPath expression to select the metadata value for each item.

Yes

N/A

metadataName

The metadata property name.

Yes

N/A

JSON file example:

Other

{
	"abbreviations": [
		{
			"pattern": "API",
			"shortened-for": "Application programming interface",
			"description": "In computer programming, an application programming interface (API) is a set of subroutine definitions, communication protocols, and tools for building software."
		}
	]
}

Example configuration

XML

<!-- Dictionary sequence -->
<sequential>
	<parallel>
		<dictionaryAnnotator>
			<httpSource url="http://my-server/dictionaries/solr/synonym.txt"/>
			<solrSynonymFormat annotationTypeId="demo:part" />
		</dictionaryAnnotator>

		<dictionaryAnnotator>
			<fileSource relativePath="wordlistdictionary.txt"/>
			<wordListFormat annotationTypeId="demo:product" />
		</dictionaryAnnotator>
	</parallel>

	<dictionaryAnnotator>
		<fileSource relativePath="dictionary.xml" />
		<xmlFormat itemsQuery="//products/product" patternQuery="@name" annotationTypeId="dictionary-item">
			<mapping valueQuery="./owner" metadataName="owner"/>
			<mapping valueQuery="@url" metadataName="website"/>
			</xmlFormat>
	</dictionaryAnnotator>

	<dictionaryAnnotator>
		<fileSource relativePath="dictionary.json" />
		<jsonFormat itemsQuery="$.abbreviations[*]" patternQuery="pattern" annotationTypeId="dictionary-item">
			<mapping valueQuery="shortened-for" metadataName="fully-written"/>
			<mapping valueQuery="description" metadataName="explanation"/>
		</jsonFormat>
	</dictionaryAnnotator>
</sequential>

Editor

Make sure to register the returned annotation types as custom annotations inside the editor.

HTTP source API

When using the , you must be able to handle the following request and need to respond accordingly.

GET {endpoint}

Serves a dictionary file. This request is made by Content Quality, not by the Fonto editor.

Parameters

Request

Headers

If-None-Match

Optional

Contains the entity-tag value as received from a previous request to this endpoint, if any. This header will not be set on the first request after starting Content Quality.

Response

Status

Reason and model

200

The dictionary file is returned as is, in the encoding and content type as understood by the configured dictionary annotator format.

Headers

Cache-Control

Optional

The max-age part, in seconds, is used to determine after how much time Content Quality will make a new request to this endpoint to check for file changes. If this value is not supplied, the default of 300 seconds is used.

Set max-age to a sensible value. How often does your dictionary change? How quick should dictionary changes be propagated to Content Quality?

ETag

Optional

A strong entity-tag for the dictionary file. Note that this must change if the dictionary changes.

304

The dictionary has not been changed, based on the If-None-Match request header and the current entity-tag value of the dictionary file.

400

Bad Request.

500

Any error in the 500 range indicates a problem with the dictionary endpoint.

When the request fails or has a timeout, a new request is made after the time as determined by the max-age (or its default value). In the time between the failed request and a new successful request, the annotator will not work and annotating will fail.

See the documentation for If-None-Match, ETag, and Cache-Control max-age. Most notably, depending on your webserver framework, you must be aware that the entity-tag value in the ETag and If-None-Match headers are surrounded by double quotes as per specification.

Examples

Request

Headers:

Other

If-None-Match: "x234dff"

Response

Headers:

Other

Cache-Control: max-age=300
ETag: "x234dff"

Body:

Other

foo
bar
baz