SpellCheckAnnotator

The SpellCheckAnnotator provides spell checking capability in the Fonto Content Quality analysis pipeline. This spell checker can be configured per language and supports configuring the dictionary used for spell checking. The dictionary can be augmented with various word lists to ease configuration.

The SpellCheckAnnotator is developed on top of the Hunspell engine, which is the same engine used in LibreOffice, OpenOffice.org, Mozilla Firefox and Google Chrome. Hunspell is a spell checker and morphological analyzer designed for languages with rich morphology and complex word compounding and character encoding. Hunspell itself is based on MySpell and is backward-compatible with MySpell dictionaries.

The SpellCheckAnnotator needs to be configured with a so-called dictionary. This dictionary describes the words and the morphological rules to determine whether a word is spelled correctly and, if not, generate appropriate suggestions. The Fonto Content Quality SpellCheckAnnotator also includes features to ignore and prohibit specific words as well as a way to specify replacements. These features make it easier to extend Hunspell dictionaries without modifying them directly.

The SpellCheckAnnotator supports configuration for multiple languages. For example, if your content contains both English and Dutch text, you can configure different SpellCheckAnnotators for each of the languages. Fonto Content Quality detects the language based on language indications in your content (e.g. the xml:lang attribute for example). A language is identified by its so-called BCP-47 language tag, for example en-US for American English. If the SpellCheckAnnotator detects a language it does not support, it won't scan it to avoid unnecessary false positives.

The Fonto Content Quality editor add-on provides an out-of-the-box visualization for spelling errors detected by the SpellCheckAnnotator. This means you don't have to do any configuration work on the Fonto Editor side to configure a specific UI.

Configuration

The SpellCheckAnnotator can be configured by adding a <spellCheckAnnotator /> element to the analysis configuration file. The <spellCheckAnnotator /> takes the following arguments:

Attribute

Description

Required

Default

languages

A space separated list of at least one BCP-47 language tag.

This list specifies the languages supported by this annotator. Content in these languages is checked for spelling errors.

The language of the content is matched against the list of languages identified by this attrbute. The language of the content is controlled via the baseIETFLanguageQuery and nodeIETFLanguageQuery configuration options of the fontoxml-content-quality add-on.

Yes

N/A

hunspellDictionary

A relative path to the Hunspell dictionary file. This file must exist on disk, Fonto Content Quality does not provide any files out of the box.

See the Linux Man Pages for Hunspell for details on both the Hunspell dictionary and affix file formats. See also Where do I get the Hunspell dictionary and affix files? in the FAQ section.

This attribute also supports environment variables.

No*

Even though this attribute is marked optional, in a typical configuration it is recommended since it provides the core functionality.

N/A

hunspellAffix

A relative path to the Hunspell affix file. This file must exist on disk, Fonto Content Quality does not provide any files out of the box.

See the Linux Man Pages for Hunspell for details on both the Hunspell dictionary and affix file formats. See also Where do I get the Hunspell dictionary and affix files? in the FAQ section.

This attribute also supports environment variables.

No*

Even though this attribute is marked optional, in a typical configuration it is recommended since it provides the core functionality.

N/A

ignore

A relative path to the ignore text file. This file must exist on disk, Fonto Content Quality does not provide any files out of the box.

This specifies a list of individual words which are always considered valid. No suggestions are generated for words on this list.

This file must be in the Word list text format.

This attribute also supports environment variables.

No*

N/A

prohibit

A relative path to the prohibit text file. This file must exist on disk, Fonto Content Quality does not provide any files out of the box.

This specifies a list of individual words which are always considered invalid. No suggestions are generated for words on this list.

This file must be in the Word list text format.

This attribute also supports environment variables.

No*

N/A

spelling

A relative path to the spelling text file. This file must exist on disk, Fonto Content Quality does not provide any files out of the box.

This specifies a list of individual words which are considered to be spelled correctly. These words also appear as suggestions for similar, but incorrectly spelled, words.

This feature is useful if you want to extend the Hunspell dictionaries without modifying the Hunspell dictionary and affix file directly. It does not allow you to specify any morphological flags.

This file must be in the Word list text format.

This attribute also supports environment variables.

No*

N/A

replacements

A relative path to the replacements text file. This file must exist on disk, Fonto Content Quality does not provide any files out of the box.

This specifies a list of individual words for which specific replacements are given.

This is useful in case you want to completely control the suggestion for a given word because it overrides any and all suggestions generated by Hunspell.

This file must be in the Solr explicit mapping text format.

This attribute also supports environment variables.

No*

N/A

* At least one of the following attributes must be specified for a valid configuration: hunspellDictionary, hunspellAffix, ignore, prohibit, spelling or replacements. It is recommended to at least configure hunspellDictionary and hunspellAffix because these attributes provide the core functionality of a spell checker.

The configuration options for hunspellDictionary, hunspellAffix, ignore, prohibit, spelling or replacements are evaluated in the following order (think of them as priorities):

  1. ignore, meaning all words in this file are considered valid regardless of what lower priority options say;

  2. prohibit, meaning all words in this file are considered invalid regardless of what lower priority options say;

  3. replacements, meaning all words in this file are considered invalid regardless of what lower priority options say;

  4. spelling, meaning all words in this file are considered valid regardless of what lower priority options say;

  5. hunspellDictionary and hunspellAffix, all remaining words are checked for validity against the Hunspell dictionary.

Formats

The dictionary format can be one of the following, and have configuration attributes themselves. The dictionaries are case sensitive.

Word list text format

The word list text format is a text formatting where each individual word is separated by a newline. This format specifies:

  1. White space at the beginning and ending of a line is insignificant and thus trimmed;

  2. Empty lines are not considered valid words and thus ignored;

  3. Lines starting with a ‘#’ are comments and thus ignored;

  4. Remaining lines are considered words.

Example

Other

# This is a comment
 
# Newlines are ignored
word-a
  word-b  

A word list containing the words: “word-a” and “word-b”.

Solr explicit mapping text format

The Solr explicit mapping text format is a text formatting where each individual word is separated by a newline. This format specifies:

  1. White space at the beginning and ending of a line is insignificant and thus trimmed;

  2. Empty lines are not considered valid mappings and thus ignored;

  3. Lines starting with a # are comments and thus ignored;

  4. Splits lines by => into lhs and rhs;

  5. Splits the lhs and rhs by ,;

  6. Each permutation of lhs and rhs is considered a mapping.

Example

Other

# This is a comment
 
# Newlines are ignored
nineties => 1990s
FontoXML,Liones => Fonto

An Solr explicit mapping, containing:

  1. a one-to-one mapping from: "nineties" to "1990s";

  2. a many-to-one mapping from: "FontoXML" and "Liones" to "Fonto"

Produces

The SpellCheckAnnotator produces annotations for each misspelled word. The qualified name of the produced annotation is Q{urn:fontoxml:content-quality:spelling:1.0.0}spelling-error.

The spell check annotation contains the following metadata:

JSON

{
	{ "replacements", ["a", "list", "of", "suggested", "replacements"] }
}

Example configuration

Please note that the following examples only provide the analysis configuration, they do not include the referenced files. As a developer you need to create those files or download them from relevant sources, see Where do I get the Hunspell dictionary and affix files? for instructions.

The following example configures a SpellCheckAnnotator for the en (English) and en-US (American English) language. It features all configuration options.

XML

<?xml version="1.0" encoding="utf-8"?>
<analysis
	xmlns="http://schemas.fontoxml.com/content-quality/1.0/analysis-configuration.xsd">

	<spellCheckAnnotator
		languages="en en-US"
		hunspellAffix="en/dictionary.aff"
		hunspellDictionary="en/dictionary.dic"
		ignore="en/ignore.txt"
		prohibit="en/prohibit.txt"
		replacements="en/replacements.txt"
		spelling="en/spelling.txt" />

</analysis>

Full features spell check configuration example

The following example features a typical Hunspell-based configuration for the en (English) and en-US (American English) language. It is your recommended starting point for a spell check configuration.

XML

<?xml version="1.0" encoding="utf-8"?>
<analysis
	xmlns="http://schemas.fontoxml.com/content-quality/1.0/analysis-configuration.xsd">

	<spellCheckAnnotator
		languages="en en-US"
		hunspellAffix="en/dictionary.aff"
		hunspellDictionary="en/dictionary.dic" />

</analysis>

Typical Hunspell-based configuration example

The following example uses a simple word list to provide minimal spell checking capabilities for the en (English) and en-US (American English) language. It is easy to set up, but it does not leverage any specific morphological rules. As such, this is only recommended being used for languages without inflections or compounded words.

XML

<?xml version="1.0" encoding="utf-8"?>
<analysis
	xmlns="http://schemas.fontoxml.com/content-quality/1.0/analysis-configuration.xsd">

	<spellCheckAnnotator
		languages="en en-US"
		spelling="spelling.txt" />
	
	<!--spelling.txt contains a list of all valid English words -->

</analysis>

Minimal spell check without any morphological analysis

FAQ

Which languages are supported?

Fonto Content Quality does not come with any pre-configured dictionaries out-of-the-box. You can however, configure a spell check annotator for any language you like as long as you can find a dictionary for that particular language. Fortunately, for most languages there are dictionaries available, see Where do I get the Hunspell dictionary and affix files? for details.

Where do I get the Hunspell dictionary and affix files?

Fonto Content Quality does not come with any pre-configured dictionaries out-of-the-box. You can find Hunspell dictionaries for practically any language you need. You can download dictionaries from the OpenOffice extensions, LibreOffice extensions or Firefox add-ons pages. The extension file is a ZIP-archive which you should unzip. The unzipped archive should include the desired *.dic and *.aff files. Alternatively, you can download the *.dic and *.aff files directly from the LibreOffice (mirror, mirror, mirror) and Chromium repositories. You can also create your own, consult the Linux Man Pages for Hunspell for details on both the Hunspell dictionary and affix file formats.

Please make sure you read the license agreement for each dictionary to check under which conditions you're allowed to use it.

Where do I get the ignore, prohibit, replacements and spelling files?

You can create them manually in a text editor in the format described in appropriate Word list text format or Solr explicit mapping text format. Alternatively, you can scour the internet. Many open-source Natural Language Processing tools, like LanguageTool, have these files in their repositories. Typically, there are also communities centered around languages. These communities often publish word lists as well. For example WordNet and OpenTaal.

Please make sure you read the license agreement for each resource to check under which conditions you're allowed to use it.

Can I add words to the dictionary directly from the user interface?

In short, no. We do not offer this feature out-of-the-box. Typically, we see these words need to be approved by someone before adding it to the dictionary in order to ensure consistency across the organization. You can override the built-in configuration in Fonto Editor. This would allow you to plug in your own UI and workflow for approval. From a SpellCheckAnnotator perspective you probably want to add those words to the ignore or spelling lists. Note that changes to these files require a redeploy or restart of Fonto Content Quality.

How can I optimize the performance of the spell check annotator?

In general, the more words you can add to ignore, prohibit and replacements lists the better. The Hunspell dictionary and words in the spelling list are subject to morphological analysis which, compared to ignore, prohibit and replacements, is relatively expensive. However, even under heavy load, we rarely see single logical CPU core spikes above 30%.

Does the spell check annotator support checking phrases containing multiple words?

No, the spell check works on individual words, not phrases. Consider using LanguageTool and the LanguageToolAnnotator.

Does the spell check annotator support grammar checking?

No, the spell check does not provide grammar checking capabilities. Consider using LanguageTool and the LanguageToolAnnotator.

What is the difference between the SpellCheckAnnotator and the DictionaryAnnotator?

There is indeed some overlap between the features offered by SpellCheckAnnotator and the DictionaryAnnotator. The SpellCheckAnnotator is designed for spell checking only. The built-in UI is centered around spelling mistakes and potential suggestion to fix those. The DictionaryAnnotator on the other hand is designed to attach arbitrary metadata to the text it annotates. It does not come with a built-in UI. To help you decide which one is applicable, the rule of thumb is: if it is considered a spelling mistake, use a SpellCheckAnnotator otherwise a DictionaryAnnotator. The two annotators are designed to work together, so you can freely combine them in an analysis configuration.

How is the language of the content determined?

By default, the xml:lang attribute is used to determine the language of the content. The value of this attribute must match one of the languages specified in the languages attribute in order for the SpellCheckAnnotator to scan the content for spelling mistakes. If your content does not have an xml:lang attribute, or you want to control its value, you can use the baseIETFLanguageQuery and nodeIETFLanguageQuery configuration options of the fontoxml-content-quality add-on.