Static Public Member Functions
_.Library.Status	AddWord (_.Library.String pLanguage, _.Library.String pWord, _.Library.Integer pFrequency, _.Library.Boolean pClean, _.Library.Boolean pVerbose)

_.Library.Status	AppendTrainingDataFromDomain (_.Library.String pDomainName, _.Library.String pLanguage, _.Library.Integer pEntType, _.Library.Boolean pClean, _.Library.Boolean pVerbose)

_.Library.Status	AppendTrainingDataFromFiles (_.Library.String pDirectory, _.Library.String pLanguage, _.Library.Boolean pClean, _.Library.Boolean pVerbose)

_.Library.Status	AppendTrainingDataFromQuery (_.Library.ResultSet pResultSet, _.Library.String pLanguage, _.Library.Boolean pClean, _.Library.Boolean pVerbose)

_.Library.Status	CleanWordList (_.Library.String pLanguage, _.Library.Boolean pVerbose, _.Library.String pOutputFile, _.Library.String pFilter)
	Clears any identifiable compounds from the current decompound dictionary for pLanguage. More...

_.Library.Status	ClearTrainingData (_.Library.String pLanguage)
	Drops ALL training data for a given language. More...

_.Library.Status	NeverSeparate (_.Library.String pLanguage, _.Library.String pString)
	Marks pString as a character sequence that should never be split off and. More...

_.Library.Status	RemoveWord (_.Library.String pLanguage, _.Library.String pWord)
	Removes a word from the compound dictionary for the supplied language. More...

Static Private Member Functions
_.Library.Status	__GenerateWords (_.Library.String pText, pLangProps, _.Library.Boolean pTrackCaps, _.Library.Integer pFreq)
	Queues words in a PPG ^\|\|IK.Words, to be saved by <method>SaveWords</method>

_.Library.Status	__SaveWords (_.Library.Integer pLangId, _.Library.Boolean pVerbose)
	Saves the words queued in ^\|\|IK.Words by <method>GenerateWords</method>

Detailed Description

This class contains utility methods to manage the word list used by the decompounding algorithm. Decompounding is about identifying the words making up a compound term, such as the words "thunder" and "storm" in the compound term "thunderstorms". It is used primarily for search purposes, allowing you to find records containing compounds of the search terms too. Lanugages like German, where compounding happens often, require decompounding support for a good search experience.

Training the decompounder

The decompounding algorithm supplied here requires a list of candidate words it will try to recognize in to-be-decompounded terms. These candidate words can be added through training the algorithm using any of the following methods, which accept free text that will be cut into candidate terms and then stripped of any recognizable compounds:

<method>AppendTrainingDataFromQuery</method> loads candidate words from a query result set
<method>AppendTrainingDataFromFiles</method> loads candidate words from plaintext files
<method>AppendTrainingDataFromDomain</method> loads candidate words from an iKnow domain

Alternatively, individual words can be added and removed through the <method>AddWord</method> and <method>RemoveWord</method> methods. Words that should never be separated (returned as a single word) can be registered through the <method>NeverSeparate</method>.

Invoking the decompounder

Decompounding is used by iFind indices who have their INDEXOPTION set to 2 (see also <class>iFind.Index.Basic</class>). When subsequently adding records to such an indexed table, all words will be checked for compounding and additional index structures will be populated to allow retrieving records based on the compounding words.

The algorithm can also be invoked directly through a <class>iKnow.Stemmer</class> object, should there be any requirement to find the compounding words of a given term (ie for debug purposes).

// simple training do ##class(iKnow.Stemming.DecompoundUtils).AddWord("en", "thunder") do ##class(iKnow.Stemming.DecompoundUtils).AddWord("en", "storm") // invoke decompounder write ##class(iKnow.Stemmer).GetDefault("en", .tStemmer) write tStemmer.Decompound("thunderstorms", .tWords) zwrite tWords

Member Function Documentation

◆ AddWord()

_.Library.Status AddWord	(	_.Library.String	pLanguage,
		_.Library.String	pWord,
		_.Library.Integer	pFrequency,
		_.Library.Boolean	pClean,
		_.Library.Boolean	pVerbose
	)

static

Adds a word to the compound dictionary for the supplied language. The supplied word will be

treated as a valid compound element the algorithm will no longer try to split in smaller elements. Optionally supply a positive integer frequency value to increase its weight when multiple options are available.

If pWord is also present in the list of strings never to split off through a call to <method>NeverSeparate</method>, it will be removed from that list.

When performing a lot of manual updates, it is recommended to set pClean=0 and only run the <method>CleanWords</method> method once after all additions, to verify if these new additions indicate particular existing words should be removed as they turn out to be compounds themselves.

◆ AppendTrainingDataFromDomain()

_.Library.Status AppendTrainingDataFromDomain	(	_.Library.String	pDomainName,
		_.Library.String	pLanguage,
		_.Library.Integer	pEntType,
		_.Library.Boolean	pClean,
		_.Library.Boolean	pVerbose
	)

static

Appends word frequency information drawn from an existing iKnow domain to

the word dictionary for decompounding in this namespace. When pEntType=$$$ENTTYPEANY (default), the full sentence values (with literal info) will be used to derive words. To restrict this to concepts or relations only, use $$$ENTTYPECONCEPT resp. $$$ENTTYPERELATION.

Multiple calls to this method (for different resultsets) will append to the existing info. Use <method>ClearTrainingData</method> if you want to drop all existing data.

When pClean=1, the generated word list will automatically be cleaned after loading the new data through a call to <method>CleanWordList</method>. You may use pClean=0 and only call <method>CleanWordList</method> after appending training data from multiple sources, but it should be called once before decompounding any new words through the <class>iKnow.Stemmer</class> object.

◆ AppendTrainingDataFromFiles()

_.Library.Status AppendTrainingDataFromFiles	(	_.Library.String	pDirectory,
		_.Library.String	pLanguage,
		_.Library.Boolean	pClean,
		_.Library.Boolean	pVerbose
	)

static

Appends word frequency information drawn from the *.txt files in pDirectory to

the word dictionary for decompounding in this namespace. Multiple calls to this method (for different directories) will append to the existing info. Use <method>ClearTrainingData</method> if you want to drop all existing data.

When pClean=1, the generated word list will automatically be cleaned after loading the new data through a call to <method>CleanWordList</method>. You may use pClean=0 and only call <method>CleanWordList</method> after appending training data from multiple sources, but it should be called once before decompounding any new words through the <class>iKnow.Stemmer</class> object.

◆ AppendTrainingDataFromQuery()

_.Library.Status AppendTrainingDataFromQuery	(	_.Library.ResultSet	pResultSet,
		_.Library.String	pLanguage,
		_.Library.Boolean	pClean,
		_.Library.Boolean	pVerbose
	)

static

Appends word frequency information drawn from the first column of the supplied ResultSet to

the word dictionary for decompounding in this namespace. Multiple calls to this method (for different resultsets) will append to the existing info. Use <method>ClearTrainingData</method> if you want to drop all existing data.

When pClean=1, the generated word list will automatically be cleaned after loading the new data through a call to <method>CleanWordList</method>. You may use pClean=0 and only call <method>CleanWordList</method> after appending training data from multiple sources, but it should be called once before decompounding any new words through the <class>iKnow.Stemmer</class> object.

◆ CleanWordList()

_.Library.Status CleanWordList	(	_.Library.String	pLanguage,
		_.Library.Boolean	pVerbose,
		_.Library.String	pOutputFile,
		_.Library.String	pFilter
	)

static

Clears any identifiable compounds from the current decompound dictionary for pLanguage.

This method should be run at least once between appending data to the training set through any of the Append* methods in this class and using the Decompound() method in a <class>iKnow.Stemmer</class> object.

◆ ClearTrainingData()

_.Library.Status ClearTrainingData ( _.Library.String pLanguage )

static

Drops ALL training data for a given language.

Use with care.

◆ NeverSeparate()

_.Library.Status NeverSeparate	(	_.Library.String	pLanguage,
		_.Library.String	pString
	)

static

Marks pString as a character sequence that should never be split off and.

therefore never be returned as a compound element of its own. If this string was also part of the compound dictionary as a candidate, it will be removed automatically as if calling <method>RemoveWord</method>

◆ RemoveWord()

_.Library.Status RemoveWord	(	_.Library.String	pLanguage,
		_.Library.String	pWord
	)

static

Removes a word from the compound dictionary for the supplied language.

This word will no longer

be treated as a valid compound element. Use this to clear the list of eventual composite words added previously.

Static Public Member Functions

Static Private Member Functions

Detailed Description

Training the decompounder

Invoking the decompounder

Member Function Documentation

◆ AddWord()

◆ AppendTrainingDataFromDomain()

◆ AppendTrainingDataFromFiles()

◆ AppendTrainingDataFromQuery()

◆ CleanWordList()

◆ ClearTrainingData()

◆ NeverSeparate()

◆ RemoveWord()