AnnotationsRegexAnnotator

The AnnotationsRegexAnnotator is an annotator that can be used to define patterns. Rather than working at the character level, as with standard regular expressions it matches on annotations and can map those to new semantic annotations. Because it only reasons over annotations, it must be used in combination with another annotator which creates annotations.

This annotator is experimental. While this annotator is very likely to land as a feature of Fonto Content Quality, the API surface might still change before it becomes a stable feature. Your input is greatly appreciated.

You will need to add the following namespace declaration to your analysis configuration file:

Other

xmlns:experimental="http://schemas.fontoxml.com/content-quality/1.0/experimental-analysis-configuration.xsd"

Configuration

Attribute

Description

Required

Default

experimental:annotationTypeId

The type identifier to set on the added annotations.

The attribute is not required because you have the option to map capture groups to annotations instead of annotating the complete match.

No

N/A

experimental:annotationTypesToTokenize

A list of annotation types identifiers used as input to reason over, the order in this attribute is not important. These can be any type of annotation, set by any other annotator, including annotations created by another AnnotationsRegexAnnotator. Any annotation type you use in the pattern must be listed in this attribute.

You can also list annotation types in this attribute which are not explicitly used in the pattern. For example, the text "A B C" where there are annotations on "A", "B", and "C"; And you want to match "A C" (and thus not "A B C"), you can use annotationTypesToTokenize="A B C" and pattern="A C". This will not match on the text "A B C", but will match on the text "A C" even though B is not in the pattern. This is because annotation type B is in the annotation sequence based on the specified annotationTypesToTokenize value, and thus interrupts the pattern of "A" followed by "B". If you were to remove "B" from , the text "" would match for pattern "" because "" is not included in the annotation sequence anymore even though there exists an annotation for "". See the Configuration examples chapter for a use-case where this is used to limit matching.

The annotations MUST NOT have overlap with each other. Take special care when creating the annotation used by this annotator. Use more specific annotator rules and/or filters to make sure the input annotation types have no overlap. This can be detected in Fonto Content Quality console output, which will show an InvalidOperationException starting with the text "Overlap detected in the annotation sequence:".

For example, the text "A B C" where there are annotations on "A", "B", and "C", will work. But when the annotations are "A B" and "B C", this annotator will not work.

Yes

N/A

experimental:pattern

The AnnotationsTokenRegex syntax that will be used as the search pattern. Any annotation type used in the pattern must be listed in the annotationTypesToTokenize attribute. See pattern syntax chapter below.

Yes

N/A

Make sure to register the given annotationTypeId as a custom annotation inside the editor.

Map a subexpression to its own annotation using mapCaptureGroup

You have the ability to map subexpressions to their own annotations using the configuration option mapCaptureGroup inside the annotationsRegexAnnotator.

Attribute

Description

Required

Default

name

The capture group name.

Yes

N/A

annotationTypeId

The type identifier to set on the added annotations.

Yes

N/A

Produces

Annotation types

The AnnotationsRegexAnnotator produces annotation types as configured in the annotationTypeId attribute and/or in the mapCaptureGroup's annotationTypeId attribute. These annotations can also be used for as input for another AnnotationsRegexAnnotator.

Metadata

JSON

{
	"annotation": [
		{
			"metadata": {
				"contains": "annotation type specific metadata"
			},
			"type": {
				"localName": "annotation-type",
				"namespaceURI": "urn:annotation:namespace"
			},
			"text": "The text of a matched annotation"
		}
	]
}

Pattern syntax

The AnnotationTokensRegex pattern language is designed to be similar to standard regular expressions over character strings. Many of the concepts from standard regular expressions for strings, such as wildcards and capturing groups, are supported by AnnotationTokensRegex and use a similar syntax. The specifics of the language are described below. The main difference is in the syntax for matching individual tokens. Whitespace is insignificant in patterns unless stated otherwise.

If you want to experiment with writing a pattern without using a complete Fonto Content Quality analysis configuration, you can rewrite a pattern to a regex.

Convert all tokens to letters, and remove whitespace. For example, "[X] ([Y] | [Z]){2,3}", to "X(Y|Z){2,3}". And use a input string containing these letters instead of the token sequence, e.g. "XYYZABX", which should match "XYYZ". You can use an online regular expression tool to test this way, for example on regex101.com.

Tokens

Symbol

Meaning

[.]

Matches any single token.

[annotationTypeId]

Matches an annotation with the given type id.

[^annotationTypeId]

Matches any annotation which is not of the given type id.

Basic combinators

Symbol

Meaning

X Y

Expression X followed by expression Y.

X | Y

Expression X or expression Y.

Position indicators

Symbol

Meaning

^

At the beginning of the input. Only allowed at the start of the expression.

$

At the end of the input. Only allowed at the end of the expression.

Groups

Symbol

Meaning

(X)

Capture expression X in a capture group.

(?<name> X)

Capture expression X in a named capture group with name name.

(?: X)

Expression X in a non-capturing group.

X (?<name>) Y

Capture the range between expression X and Y in a named capture group with name name.

^ (?<name>) Y

Capture the range between the beginning of the input and expression Y in a named capture group with name name.

X (?<name>) $

Capture the range between expression X and the end of the input in a named capture group with name name.

Greedy quantifiers

Symbol

Meaning

X?

Expression X, once or not al all.

X*

Expression X, zero or more times.

X+

Expression X, on or more times.

X{n}

Expression X, exactly n times.

X{n,}

Expression X, at least n times.

X{n,m}

Expression X, at least n times but no more then m times.

Lazy quantifiers

Symbol

Meaning

X??

Expression X, once or not al all.

X*?

Expression X, zero or more times.

X+?

Expression X, on or more times.

X{n}?

Expression X, exactly n times.

X{n,}?

Expression X, at least n times.

X{n,m}?

Expression X, at least n times but no more then m times.

Example configuration

Group annotations

The following example shows the AnnotationsRegexAnnotator being configured to group two annotations as one, when a fontoxml-section annotation is followed by an fontoxml-number annotation and are not interrupted by a fontoxml-chapter annotation.

This example will match "In section 2, chapter 42 the author stated..." with an annotation on "section 2". But it will not match "In section TODO, chapter 42 the author stated...". This is because fontoxml-chapter is used as an input annotation, but not in the pattern and thus breaks the fontoxml-section followed by fontoxml-number pattern.

XML

<sequential>
	<regexAnnotator annotationTypeId="fontoxml-number" pattern="\d+" />

	<regexAnnotator annotationTypeId="fontoxml-section" pattern="section" ignoreCase="true" />
	<regexAnnotator annotationTypeId="fontoxml-chapter" pattern="chapter" ignoreCase="true" />

	<experimental:annotationsRegexAnnotator
		experimental:annotationTypeId="fontoxml-section-with-number"
		experimental:annotationTypesToTokenize="fontoxml-chapter fontoxml-number fontoxml-section"
		experimental:pattern="[fontoxml-section] [fontoxml-number]"
	/>
</sequential>

Capture groups

The following example shows AnnotationsRegexAnnotator being configured to only annotate a part of the pattern.

This example will match "Content containing some start followed by chapter 42 followed by some end will match." with an annotation on "chapter 42". But it will not match "Content referencing chapter 42 without start and/or end.". This is because fontoxml-start and fontoxml-end are not found and thus the expression does not yield matches even though the expression in the named capture group itself matches.

XML

<sequential>
	<regexAnnotator annotationTypeId="fontoxml-number" pattern="\d+" />

	<regexAnnotator annotationTypeId="fontoxml-start" pattern="some start" ignoreCase="true" />
	<regexAnnotator annotationTypeId="fontoxml-end" pattern="some end" ignoreCase="true" />
	<regexAnnotator annotationTypeId="fontoxml-chapter" pattern="chapter" ignoreCase="true" />

	<experimental:annotationsRegexAnnotator
		experimental:annotationTypesToTokenize="fontoxml-chapter fontoxml-end fontoxml-number fontoxml-start"
		experimental:pattern="[fontoxml-start] (?&lt;capture&gt; [fontoxml-chapter] [fontoxml-number]) [fontoxml-end]">
		<experimental:mapCaptureGroup experimental:name="capture" experimental:annotationTypeId="fontoxml-captured" />
	</experimental:annotationsRegexAnnotator>
</sequential>