IRISLIB database
Builder Class Reference
Inheritance diagram for Builder:
Collaboration diagram for Builder:

Public Member Functions

_.Library.Status OnCreateExportTable (_.Dictionary.ClassDefinition pClassDef, _.Library.Boolean pVerbose)
 Callback invoked by <method>ExportDataTable</method> when creating the export table definition.
 
_.Library.Status OnExportTable (_.Library.String pClassName, _.Library.Boolean pVerbose, _.Library.Boolean pTracking)
 Callback invoked by <method>ExportDataTable</method> to load the data into export table <class>pClassName</class>.
 
_.Library.Status OnGenerateClassifier (_.iKnow.Classification.Definition.Classifier pDefinition, _.Library.Boolean pVerbose, _.Library.Boolean pIncludeBuilderInfo)
 Appends the ClassificationMethod element for this type of classifier.
 
- Public Member Functions inherited from RegisteredObject
_.Library.Status OnAddToSaveSet (_.Library.Integer depth, _.Library.Integer insert, _.Library.Integer callcount)
 This callback method is invoked when the current object is added to the SaveSet,. More...
 
_.Library.Status OnClose ()
 This callback method is invoked by the <METHOD>Close</METHOD> method to. More...
 
_.Library.Status OnConstructClone (_.Library.RegisteredObject object, _.Library.Boolean deep, _.Library.String cloned)
 This callback method is invoked by the <METHOD>ConstructClone</METHOD> method to. More...
 
_.Library.Status OnNew ()
 This callback method is invoked by the <METHOD>New</METHOD> method to. More...
 
_.Library.Status OnValidateObject ()
 This callback method is invoked by the <METHOD>ValidateObject</METHOD> method to. More...
 

Public Attributes

 ClassificationMethod
 The general method used for classification: More...
 
 Description
 Optional description for the Classifier. More...
 
 DocumentVectorLocalWeights
 Local Term Weights for the document vector to register in the ClassificationMethod element. More...
 
 DocumentVectorNormalization
 Document vector normalization method to register in the Classification element. More...
 
 MinimumSpread
 The minimum number of records in the training set that should contain a term before it. More...
 
 MinimumSpreadPercent
 The minimum fraction of records in the training set that should contain a term before it. More...
 

Private Member Functions

_.Library.Status AddCRC (_.Library.List pCRC, _.Library.String pNegation, _.Library.String pCount, _.Library.Integer pIndex)
 
_.Library.Status AddCategory (_.Library.String pName, _.Library.String pSpec, _.Library.Integer pRecordCount, _.Library.String pDescription)
 Adds an optional category named pName for the classifier being built by this class. More...
 
_.Library.Status AddCooccurrence (_.Library.List pValue, _.Library.String pNegation, _.Library.String pCount, _.Library.Integer pIndex)
 
_.Library.Status AddEntity (_.Library.String pValue, _.Library.String pNegation, _.Library.String pCount, _.Library.Integer pIndex)
 
_.Library.Status AddTermsFromSQL (_.Library.String pSQL, _.Library.String pType, _.Library.String pNegationContext, _.Library.String pCount)
 
_.Library.Status CreateClassifierClass (_.Library.String pClassName, _.Library.Boolean pVerbose, _.Library.Boolean pIncludeBuilderInfo, _.Library.Boolean pOverwrite, _.Library.Boolean pCompile)
 
 DispatchGetProperty (_.Library.String Property)
 Dispatch unknown property getters to <property>MethodBuilder</property>
 
 DispatchMethod (_.Library.String Method, Args)
 Dispatch unknown method calls to <property>MethodBuilder</property>
 
 DispatchSetProperty (_.Library.String Property, Val)
 Dispatch unknown property setters to <property>MethodBuilder</property>
 
_.Library.Status ExportDataTable (_.Library.String pClassName, _.Library.Boolean pOverwrite, _.Library.Boolean pVerbose, _.Library.Boolean pTracking)
 Exports the data in the training set to a new table pClassName, with columns. More...
 
_.Library.Status GenerateClassifier (_.iKnow.Classification.Definition.Classifier pDefinition, _.Library.Boolean pIncludeBuilderInfo, _.Library.Boolean pVerbose)
 
_.Library.Status GetCategoryInfo (pCategories)
 Returns all categories added so far: More...
 
_.Library.Status GetTerms (pTerms)
 Returns all terms added so far: More...
 
_.Library.Status PopulateTerms (_.Library.Integer pCount, _.Library.String pType, _.Library.String pMetric, _.Library.Boolean pPerCategory)
 
_.Library.Status RemoveTerm (_.Library.String pValue, _.Library.String pType, _.Library.String pNegation, _.Library.String pCount)
 Removes pValue from the first term that contains it meeting the pType More...
 
_.Library.Status RemoveTermAtIndex (_.Library.Integer pIndex)
 Removes the term at index pIndex. More...
 
_.Library.Status RemoveTermEntryAtIndex (_.Library.String pValue, _.Library.Integer pIndex, _.Library.Boolean pRemovedTerm)
 Removes a specific entry pValue from the term at index pIndex.
 
_.Library.Status Reset ()
 Resets the term and category lists for this classifier.
 
_.Library.Status TestClassifier (_.Library.RawString pTestSet, pResult, _.Library.Double pAccuracy, _.Library.String pCategorySpec, _.Library.Boolean pVerbose)
 

Static Private Member Functions

_.Library.Status LoadFromDefinition (_.Library.String pClassName, _.iKnow.Classification.Builder pBuilder, _.Library.Boolean pValidateFirst)
 Loads the categories and terms from an existing Classifier class pClassName. More...
 

Additional Inherited Members

- Static Public Attributes inherited from RegisteredObject
 CAPTION = None
 Optional name used by the Form Wizard for a class when generating forms. More...
 
 JAVATYPE = None
 The Java type to be used when exported.
 
 PROPERTYVALIDATION = None
 This parameter controls the default validation behavior for the object. More...
 

Detailed Description

The InterSystems IRIS NLP iKnow technology is now deprecated. Please see the product documentation for more detail.

This is the framework class for building Text Categorization models, generating valid <class>iKnow.Classification.Classifier</class> subclasses.
Here's an example using the <class>iKnow.Classification.IKnowBuilder</class>:

// first initialize training and test sets set tDomainId = $system.iKnow.GetDomainId("Standalone Aviation demo") set tTrainingSet = ##class(iKnow.Filters.SimpleMetadataFilter).New(tDomainId, "Year", "<", 2007) set tTestSet = ##class(iKnow.Filters.GroupFilter).New(tDomainId, "AND", 1) // NOT filter do tTestSet.AddSubFilter(tTrainingSet)

// Initialize Builder instance with domain name and test set set tBuilder = ##class(iKnow.Classification.IKnowBuilder).New("Standalone Aviation demo", tTrainingSet)

// Configure it to use a Naive Bayes classifier set tBuilder.ClassificationMethod = "naiveBayes"

// Load category info from metadata field "AircraftCategory" write tBuilder.LoadMetadataCategories("AircraftCategory")

// manually add a few terms write tBuilder.AddEntity("ultralight vehicle") set tData(1) = "helicopter", tData(2) = "helicopters" write tBuilder.AddEntity(.tData) write tBuilder.AddEntity("balloon",, "partialCount") write tBuilder.AddCooccurrence($lb("landed", "helicopter pad"))

// or add them in bulk by letting the Builder instance decide write tBuilder.PopulateTerms(50)

// after populating the term dictionary, let the Builder generate a classifier class write tBuilder.CreateClassifierClass("User.MyClassifier")

Member Function Documentation

◆ AddCRC()

_.Library.Status AddCRC ( _.Library.List  pCRC,
_.Library.String  pNegation,
_.Library.String  pCount,
_.Library.Integer  pIndex 
)
private

Adds one or more CRCs as a single term to the Text Categorization model's term dictionary.

The term is to be counted only if it appears in the negation context defined by pNegation. If pCount = "exactCount", only exact occurrences of this CRC will be counted to calculate its base score to be fed into the categorization algorithm. If it is set to "partialCount", both exact and partial matches will be considered and if set to "partialScore", the score of all exact and partial matches will be summed as this term's base score.

Multiple CRC can be supplied either as a one-dimensional array of 3-element-Lists

.

◆ AddCategory()

_.Library.Status AddCategory ( _.Library.String  pName,
_.Library.String  pSpec,
_.Library.Integer  pRecordCount,
_.Library.String  pDescription 
)
private

Adds an optional category named pName for the classifier being built by this class.

The meaning of pSpec depends on the actual builder implementation, but should allow the builder implementation to identify the records in the training set belonging to this category.

◆ AddCooccurrence()

_.Library.Status AddCooccurrence ( _.Library.List  pValue,
_.Library.String  pNegation,
_.Library.String  pCount,
_.Library.Integer  pIndex 
)
private

Adds one or more Cooccurrences as a single term to the Text Categorization model's term dictionary.

The term is to be counted only if it appears in the negation context defined by pNegation. If pCount = "exactCount", only exact occurrences of this cooccurrence's entities will be counted to calculate its base score to be fed into the categorization algorithm. If it is set to "partialCount", both exact and partial matches will be considered and if set to "partialScore", the score of all exact and partial matches will be summed as this term's base score.

A single cooccurrence can be supplied as a one-dimensional array of strings or a List. Multiple cooccurrences can be supplied either as a one-dimensional array of Lists or as a two-dimensional array of strings

.

◆ AddEntity()

_.Library.Status AddEntity ( _.Library.String  pValue,
_.Library.String  pNegation,
_.Library.String  pCount,
_.Library.Integer  pIndex 
)
private

Adds one or more entities as a single term to the Text Categorization model's term dictionary.

The term is to be counted only if it appears in the negation context defined by pNegation. If pCount = "exactCount", only exact occurrences of this entity will be counted to calculate its base score to be fed into the categorization algorithm. If it is set to "partialCount", both exact and partial matches will be considered and if set to "partialScore", the score of all exact and partial matches will be summed as this term's base score.

Multiple entities can be supplied either as a one-dimensional array or as a List

.

◆ AddTermsFromSQL()

_.Library.Status AddTermsFromSQL ( _.Library.String  pSQL,
_.Library.String  pType,
_.Library.String  pNegationContext,
_.Library.String  pCount 
)
private

Adds all terms selected by pSQL as pType, taking the string value from the

column named "term" with negation context pNegationContext and count policy pCount. If there are columns named "type", "negation" or "count" selected by the query, any values in these columns will be used instead of the defaults supplied through the respective parameters.

When adding CRC or Cooccurrence terms, use colons to separate the composing entities.

◆ CreateClassifierClass()

_.Library.Status CreateClassifierClass ( _.Library.String  pClassName,
_.Library.Boolean  pVerbose,
_.Library.Boolean  pIncludeBuilderInfo,
_.Library.Boolean  pOverwrite,
_.Library.Boolean  pCompile 
)
private

Generates a classifier definition and saves it to a <class>iKnow.Classification.Classifier</class>

subclass named pClassName. This will overwrite any existing class with that name if pOverwrite is 1. See also <method>GenerateClassifier</method>.

◆ ExportDataTable()

_.Library.Status ExportDataTable ( _.Library.String  pClassName,
_.Library.Boolean  pOverwrite,
_.Library.Boolean  pVerbose,
_.Library.Boolean  pTracking 
)
private

Exports the data in the training set to a new table pClassName, with columns.

containing the weighted score for each term.

◆ GenerateClassifier()

_.Library.Status GenerateClassifier ( _.iKnow.Classification.Definition.Classifier  pDefinition,
_.Library.Boolean  pIncludeBuilderInfo,
_.Library.Boolean  pVerbose 
)
private

Generates a <class>iKnow.Classification.Definition.Classifier</class> XML tree based on the current

set of categories and terms, with the appropriate weights and parameters calculated by the builder implementation (see <method>OnGenerateClassifier</method>).

Use pIncludeBuilderInfo to include specifications of how this classifier was built so it can be "reloaded" from the classifier XML to retrain the model.

◆ GetCategoryInfo()

_.Library.Status GetCategoryInfo (   pCategories)
private

Returns all categories added so far:

   pCategories(n) = $lb([name], [record count])

Reimplemented in IKnowBuilder, and IFindBuilder.

◆ GetTerms()

_.Library.Status GetTerms (   pTerms)
private

Returns all terms added so far:

   pTerms(n) = $lb([string value], [type], [negation policy], [count policy])

◆ LoadFromDefinition()

_.Library.Status LoadFromDefinition ( _.Library.String  pClassName,
_.iKnow.Classification.Builder  pBuilder,
_.Library.Boolean  pValidateFirst 
)
staticprivate

Loads the categories and terms from an existing Classifier class pClassName.


Note: this does not load any (custom) weight information from the definition.

◆ PopulateTerms()

_.Library.Status PopulateTerms ( _.Library.Integer  pCount,
_.Library.String  pType,
_.Library.String  pMetric,
_.Library.Boolean  pPerCategory 
)
private

Adds pCount terms of type pType to this classifier's set of terms,

selecting those terms that have a high relevance for the categorization task based on metric pMetric and/or the specifics of this builder implementation.

If pPerCategory is 1, (pCount \ [number of categories]) terms are selected using the specified metric as calculated within each category. This often gives better results, but might not be supported for every metric or builder.

Builder implementations should ensure these terms meet the conditions set forward by <property>MinimumSpread</property> and <property>MinimumSpreadPercent</property>. <property>MinimumSpreadPercent</property> can be ignored if pPerCategory = 1

This method implements a populate method for pMetric = "NaiveBayes", selecting terms based on their highest average per-category probability. In this case, the value of pPerCategory is ignored (automatically treated as 1). Implementations for other metrics can be provided by subclasses.

Reimplemented in IKnowBuilder.

◆ RemoveTerm()

_.Library.Status RemoveTerm ( _.Library.String  pValue,
_.Library.String  pType,
_.Library.String  pNegation,
_.Library.String  pCount 
)
private

Removes pValue from the first term that contains it meeting the pType

pNegation and pCount criteria. If this is the last entry for that term, remove the whole term.

◆ RemoveTermAtIndex()

_.Library.Status RemoveTermAtIndex ( _.Library.Integer  pIndex)
private

Removes the term at index pIndex.

If the term at this position is a composite one,

all its entries are dropped along.

◆ TestClassifier()

_.Library.Status TestClassifier ( _.Library.RawString  pTestSet,
  pResult,
_.Library.Double  pAccuracy,
_.Library.String  pCategorySpec,
_.Library.Boolean  pVerbose 
)
private

Utility method to batch-test the classifier against a test set pTestSet.

Per-record results are returned through pResult:
pResult(n) = $lb([record ID], [actual category], [predicted category])

pAccuracy will contain the raw accuracy (# of records predicted correctly) of the current model. Use <class>iKnow.Classificaton.Utils</class> for more advanced model testing.

If the current model's category options were added through <method>AddCategory</method> without an appropriate category specification, use pCategorySpec to refer to the actual category values to test against.

Reimplemented in IKnowBuilder, and IFindBuilder.

Member Data Documentation

◆ ClassificationMethod

ClassificationMethod

The general method used for classification:

  • "naiveBayes" uses a probability-based approach based on the Naive Bayes theorem,
  • "rules" runs through a set of straightforward decision rules based on boolean expressions, each contributing to a single category's score if they fire. The category with the highest score wins.
  • "euclideanDistance" treats the per-category term weights as a vector in the same vector space as the document term vector and calculates the euclidean distance between these vectors and the query vector.
  • "cosineSimilarity" also treats the per-category term weights as a vector in the same vector space as the document term vector and looks at the (cosine of) the angle between these vectors.
  • "linearRegression" considers the per-category term weights to be coefficients in a linear regression formula for calculating a category score, with the highest value winning
  • "pmml" delegates the mathematical work to a predictive model defined in PMML. See also <class>iKnow.Classification.Methods.pmml</class>

 

◆ Description

Description

Optional description for the Classifier.

 

◆ DocumentVectorLocalWeights

DocumentVectorLocalWeights

Local Term Weights for the document vector to register in the ClassificationMethod element.

This might be overruled for some classification methods (ie Naive Bayes, which always uses "binary")  

◆ DocumentVectorNormalization

DocumentVectorNormalization

Document vector normalization method to register in the Classification element.

This might be overruled for some classification methods (ie Naive Bayes, which always uses "none")  

◆ MinimumSpread

MinimumSpread

The minimum number of records in the training set that should contain a term before it.

can get selected by <method>PopulateTerms</method>. (Can be bypassed for specific terms by adding them through <method>AddTerm</method>)  

◆ MinimumSpreadPercent

MinimumSpreadPercent

The minimum fraction of records in the training set that should contain a term before it.

can get selected by <method>PopulateTerms</method>, EXCEPT if it occurs in more than 50% of the records in at least one category. (Can be bypassed for specific terms by adding them through <method>AddTerm</method>)