IRISLIB database
Matcher Class Reference

The Class <CLASS>Regex.Matcher</CLASS> creates an object that does pattern matching using regular expressions. More...

Inheritance diagram for Matcher:
Collaboration diagram for Matcher:

Public Member Functions

_.Library.Integer EndGet (_.Library.Integer group)
 The EndGet method implements the <property>End</property> property.
 
_.Library.Integer GroupCountGet ()
 The GroupCountGet method implements the <property>GroupCount</property> property.
 
_.Library.String GroupGet (_.Library.Integer group)
 The GroupGet method implements the <property>Group</property> property.
 
_.Library.Boolean HitEndGet ()
 The HitEndGet method implements the <property>HitEnd</property> property.
 
_.Library.Boolean Locate (_.Library.Integer position)
 The method Locate finds a match for the regular expression. More...
 
_.Library.Boolean LookingAt (_.Library.Integer position)
 The method LookingAt attempts to find a match in the property. More...
 
_.Library.Boolean Match (_.Library.String text)
 The method Match returns true if the entire string <property>Text</property> is. More...
 
_.Library.Status OperationLimitSet (limit)
 The OperationLimitSet method implements the side effects of doing a Set More...
 
_.Library.Status PatternSet (_.Library.String pattern)
 The PatternSet method implements Set assignments to the. More...
 
_.Library.String ReplaceAll (_.Library.String replacement)
 The method ReplaceAll returns a modified copy of the property. More...
 
_.Library.String ReplaceFirst (_.Library.String replacement)
 The method ReplaceFirst returns a modified copy of the property. More...
 
_.Library.String RequiredPrefixGet ()
 The RequiredPrefixGet method implements the <property>RequiredPrefix</property> More...
 
 ResetPosition (_.Library.Integer position)
 The method ResetPosition resets any saved state from the previous. More...
 
_.Library.Integer StartGet (_.Library.Integer group)
 The StartGet method implements the <property>Start</property> property.
 
_.Library.String SubstituteIn (_.Library.String text)
 The method SubstituteIn returns the string that. More...
 
_.Library.Status TextSet (_.Library.String text)
 The TextSet method implements Set assignments to the. More...
 
- Public Member Functions inherited from RegisteredObject
_.Library.Status OnAddToSaveSet (_.Library.Integer depth, _.Library.Integer insert, _.Library.Integer callcount)
 This callback method is invoked when the current object is added to the SaveSet,. More...
 
_.Library.Status OnClose ()
 This callback method is invoked by the <METHOD>Close</METHOD> method to. More...
 
_.Library.Status OnConstructClone (_.Library.RegisteredObject object, _.Library.Boolean deep, _.Library.String cloned)
 This callback method is invoked by the <METHOD>ConstructClone</METHOD> method to. More...
 
_.Library.Status OnNew ()
 This callback method is invoked by the <METHOD>New</METHOD> method to. More...
 
_.Library.Status OnValidateObject ()
 This callback method is invoked by the <METHOD>ValidateObject</METHOD> method to. More...
 

Static Public Member Functions

_.Library.Status LastStatus ()
 The class method LastStatus returns the <class>Status</class> More...
 
- Static Public Member Functions inherited from Help
_.Library.String Help (_.Library.String method)
 This is a helper class that is used by the various SYSTEM classes to provide a Help method. More...
 

Public Attributes

 End
 The property End without a subscript contains the character. More...
 
 Group
 The property Group without a subscript contains the. More...
 
 GroupCount
 The property GroupCount contains the number of capturing groups. More...
 
 HitEnd
 The property HitEnd is true if the most recent matching. More...
 
 OperationLimit
 The property OperationLimit provides a way to limit the time taken. More...
 
 Pattern
 The property Pattern is the string representation of the regular. More...
 
 RequiredPrefix
 The property RequiredPrefix contains a string which, if nonempty, is. More...
 
 Start
 The property Start without a subscript contains the character. More...
 
 Status
 The property Status contains a <class>Status</class> value which may provide more. More...
 
 Text
 The property Text is the string to which the regular expression. More...
 

Private Attributes

 __PreviousMatchEnd
 PreviousMatchEnd is the End value of the previous match. More...
 

Additional Inherited Members

- Static Public Attributes inherited from RegisteredObject
 CAPTION = None
 Optional name used by the Form Wizard for a class when generating forms. More...
 
 JAVATYPE = None
 The Java type to be used when exported.
 
 PROPERTYVALIDATION = None
 This parameter controls the default validation behavior for the object. More...
 

Detailed Description

The Class <CLASS>Regex.Matcher</CLASS> creates an object that does pattern matching using regular expressions.

The regular expressions come from the International Components for Unicode (ICU). The ICU maintains web pages at https://icu.unicode.org.

The definition and features of the ICU regular expression package can be found in https://unicode-org.github.io/icu/userguide/strings/regexp.html.

On most platforms, installing InterSystems IRIS will also install an appropiate version of the ICU libraries. On platforms that do not have an ICU library available, evaluating any regular expression function or method will result in an <UNIMPLEMENTED> error.

A Regex.Matcher object can be created by evaluating
##class(Regex.Matcher).New(pattern) or
##class(Regex.Matcher).New(pattern,text).
The first parameter to <method>New</method> becomes the inital value of the property <property>Pattern</property>. The optional, second parameter to <method>New</method> become the inital value of the property <property>Text</property>. Setting property <property>Pattern</property> to a regular expression pattern string causes that regular expression pattern to be compiled into a Matcher object where it can be used to do multiple matching operations without being recompiled. The property <property>Text</property> contains the subject text string that is searched by a regular expressions match. Note that an empty string is considered to be an illegal regular expression so the first parameter to <method>New</method> cannot be missing nor be the empty string.

If x is a <CLASS>Regex.Matcher</CLASS> object then the built-in method <method>ConstructClone</method> can be used to copy x ( Set xnew = x.ConstructClone() ) . The state of the most recent match and any error value in the <property>Status</property> property are not cloned. The <method>ConstructClone</method> method can be faster than creating a new Matcher with the same Pattern. The <method>ConstructClone</method> method can just copy instructions for the matching engine rather than recompiling the original pattern string. On 8-bit systems <method>ConstructClone</method> can just copy the Unicode versions of the Pattern and Text properties without need to do the character-by-character conversion from the NLS 8-bit character set into Unicode.

None of the methods or operations in the <CLASS>Regex.Matcher</CLASS> package return a <class>Status</class> value. When an error is detected, these operations always throw the system exception thrown by the kernel code that interfaces to the ICU library. If a program wants to recover from a regular expression error then it is recommended that the code doing regular expression operations be surrounded with a TRY {...} block and that the error recovery be done in the corresponding CATCH {...} block. Note that a TRY block imposes no run-time performance overhead in situations where no error occurs.

The methods and operations in a <CLASS>Regex.Matcher</CLASS> object will catch any <REGULAR EXPRESSION> system error and will generate a <class>Status</class> value that may better describe that error. That <class>Status</class> value will be stored in the <property>Status</property> property of the <CLASS>Regex.Matcher</CLASS> object and in the variable objlasterror. After saving the <class>Status</class> value, the original unmodified <REGULAR EXPRESSION> system exception will be rethrown. You may examine that <class>Status</class> value by executing the following InterSystems IRIS Object Script command:
do $system.Status.DisplayError(objlasterror)

Some other system errors, like <STRING STACK>, are passed through the <CLASS>Regex.Matcher</CLASS> methods without modification.

Note that some ICU operation errors are not considered errors by the <CLASS>Regex.Matcher</CLASS> package. Examples are evaluating the <property>Start</property> and <property>End</property> properties when the previous matching operation failed. In these cases <property>Start</property> and <property>End</property> have value -2 as a character position rather than throwing an error.

Examples:

Regular expression that finds titles M., Mr., Mrs. and Ms. in a string: "\bMr?s?\."
"\b" matches a break at the beginning (or ending) of a word
"M" matches an upper-case letter-M
"r?" matches 0 or 1 occurences of a lower-case letter-r
"s?" matches 0 or 1 occurences of a lower-case letter-s
"\." matches a period character

    USER>set matcher=##class(Regex.Matcher).New("\bMr?s?\.")                             
    USER>set matcher.Text="Mrs. Sally Jones, Mr. Mike McMurry, Ms. Amy Johnson, M. Maurice LaFrance"
    USER>while matcher.Locate() {write "Found ",matcher.Group," at position ",matcher.Start,!}      
    Found Mrs. at position 1
    Found Mr. at position 19
    Found Ms. at position 37
    Found M. at position 54
    USER>write matcher.ReplaceAll("Dr.")
    Dr. Sally Jones, Dr. Mike McMurry, Dr. Amy Johnson, Dr. Maurice LaFrance
    USER>write matcher.ReplaceFirst("Dr.")
    Dr. Sally Jones, Mr. Mike McMurry, Ms. Amy Johnson, M. Maurice LaFrance
    


Regular expression that matches phone numbers of the form "(aaa) bbb-cccc" or of the form "aaa-bbb-ccc": (((\d{3}))\s*|(\d{3})-)(\d{3})-(\d{4})

(((\d{3}))\s*|(\d{3})-) matches either prefix "(aaa) " or prefix "aaa-". The outer parentheses capture this entire prefix as Group(1) and limits the range of the two prefix subpatterns in alternation by the | operator.

((\d{3}))\s* matches prefix "(aaa) "
( and ) and \s* match "(" and ")" and zero or more spaces, respectively
\d{3} matches exactly 3 digits
(\d{3}) the parentheses capture these 3 digits as Group(2)

(\d{3})- matches prefix "aaa-"
this "break" allows no other digit or letter immediately before the 3 digits
(\d{3}) captures these 3 digits as Group(3)

(\d{3})- matches "bbb-" and captures these 3 digits as Group(4)

(\d{4}) matches "cccc" and captures these 4 digits as Group(5)

this final "break" makes sure the match is not immediately followed by another digit or a letter

    ListPhones(s,a) PUBLIC {
        ; a is a reference variable.  On return
        ; a contains the number of phone numbers in string s
        ; a(i) contains just the digits of the i'th phone number
        kill a
        set a = 0
        set m=##class(Regex.Matcher).New("(\‍((\d{3})\‍)\s*|\b(\d{3})-)(\d{3})-(\d{4})\b")
        set m.Text = s
        while m.Locate() {
            ; Get first three digits from Group(2) or Group(3)
            if m.Start(2)>0 { set n=m.Group(2) }
            else { set n=m.Group(3) }
            ; Concatenate middle 3 digits and final 4 digits
            set n = n_m.Group(4) _ m.Group(5)
            ; Insert digit string into array a
            set a($increment(a)) = n
        }
    }

    ListPhones2(s,a) PUBLIC {
        ; a is a reference variable.  On return
        ; a contains the number of phone numbers in string s
        ; a(i) is i'th phone number formatted as "(aaa)bbb-cccc"
        ; Note, no blank after "(aaa)"
        kill a
        set a = 0
        set m=##class(Regex.Matcher).New("(\‍((\d{3})\‍)\s*|\b(\d{3})-)(\d{3})-(\d{4})\b")
        set m.Text = s
        while m.Locate() {
            ; Digits are concatentation of Capture groups 2,3,4,5
            ; One of group 2 or 3 is the empty string when group is not used
            set a($increment(a)) = m.SubstituteIn("($2$3)$4-$5")
        }
    }

    USER>write ^t2
    Call 617-555-1212 about item number 61773-333-4569
    USER>do ListPhones^ListPhones(^t2,.a)
    USER>zwrite a
    a=1
    a(1)=617555121

    USER>write ^t3
    Phone (212) 334-5397, (321)770-2121 and 603-646-0110
    USER>do ListPhones^ListPhones(^t3,.a)
    USER>zwrite a
    a=3
    a(1)=2123345397
    a(2)=3217702121
    a(3)=6036460110

    USER>write ^t3
    Phone (212) 334-5397, (321)770-2121 and 603-646-0110
    USER>do ListPhones2^ListPhones(^t3,.a)
    USER>zwrite a                         
    a=3
    a(1)="(212)334-5397"
    a(2)="(321)770-2121"
    a(3)="(603)646-0110"
    

<br<blockquote>


Member Function Documentation

◆ LastStatus()

_.Library.Status LastStatus ( )
static

The class method LastStatus returns the <class>Status</class>

value containing additional details about the most recent <REGULAR EXPRESSION> system error. If a <class>Regex.Matcher</class> object encounters a <REGULAR EXPRESSION> error then this status is already available in the <property>Status</property> property of the object. Executing
Do $SYSTEM.Status.DisplayError(##class(Regex.Matcher).LastStatus())
is useful when debugging a <REGULAR EXPRESSION> error following a call on $MATCH, $LOCATE or ##class(Regex.Matcher).New(x) where a <class>Regex.Matcher</class> oref value is not available.

◆ Locate()

_.Library.Boolean Locate ( _.Library.Integer  position)

The method Locate finds a match for the regular expression.

<property>Pattern</property> in the text string <property>Text</property>.

If the optional argument position is defined as an integer 1 or greater then the search for a match begins at that character position of <property>Text</property>.

If the argument position is not defined then the search for the match begins the character position following the previous match.

Locate returns 1 if the match is found; 0 otherwise.

◆ LookingAt()

_.Library.Boolean LookingAt ( _.Library.Integer  position)

The method LookingAt attempts to find a match in the property.

<property>Text</property> that must start at a particular character position. The match need not extend to the end of <property>Text</property>.

The argument position gives starting character position of the attempted match.

LookingAt returns 1 if the match is found; 0 otherwise.

◆ Match()

The method Match returns true if the entire string <property>Text</property> is.

matched by <property>Pattern</property>; it returns false if it does not match.

The argument text is optional. If the argument text is defined then the property <property>Text</property> is set to its value before the match is executed.

◆ OperationLimitSet()

_.Library.Status OperationLimitSet (   limit)

The OperationLimitSet method implements the side effects of doing a Set

assignment to change the value of the <property>OperationLimit</property> property.

◆ PatternSet()

_.Library.Status PatternSet ( _.Library.String  pattern)

The PatternSet method implements Set assignments to the.

<property>Pattern</property> property.

◆ ReplaceAll()

_.Library.String ReplaceAll ( _.Library.String  replacement)

The method ReplaceAll returns a modified copy of the property.

<property>Text</property>. It replaces every substring of <property>Text</property> that matches the <property>Pattern</property> with a replacement string. Portions of <property>Text</property> that are not matched are copied without change. The value of ReplaceAll is the resulting string. The property <property>Text</property> is not modified.

The argument replacement supplies the string to replace each matched region. The replacement string may contain references to capture groups which take the form of $1, $2, etc. The replacement string may reference the entire matched region with $0.

◆ ReplaceFirst()

_.Library.String ReplaceFirst ( _.Library.String  replacement)

The method ReplaceFirst returns a modified copy of the property.

<property>Text</property>. It replaces the first substring of <property>Text</property> that matches the <property>Pattern</property> with a replacement string. Portions of <property>Text</property> that are not matched are copied without change. The value of ReplaceFirst is the resulting string. The property <property>Text</property> is not modified.

The argument replacement supplies the string to replace the matched region. The replacement string may contain references to capture groups which take the form of $1, $2, etc. The replacement string may reference the entire matched region with $0.

◆ RequiredPrefixGet()

_.Library.String RequiredPrefixGet ( )

The RequiredPrefixGet method implements the <property>RequiredPrefix</property>

property.

◆ ResetPosition()

ResetPosition ( _.Library.Integer  position)

The method ResetPosition resets any saved state from the previous.

match. It also causes the next call to the method <method>Locate</method>() without an argument to begin at the specified character position.

The argument position is the character position from which the next call to <method>Locate</method>() without an argument will begin match attempts.

◆ SubstituteIn()

_.Library.String SubstituteIn ( _.Library.String  text)

The method SubstituteIn returns the string that.

results from substituting capturing groups from the most recent regular expression match into components of the argument <property>Text</property>. This method is undefined if the most recent regular expression match operation was not successful.

This method can be used as a low level step in regular expression replacement. It does not modify the property <property>Text</property>. For example, the method ..<method>ReplaceFirst</method>(x) is equivalent to:

    Quit:'..Locate(1) ..Text
    Quit $Extract(..Text,1,..Start-1)_..SubstituteIn(x)_
             $Extract(..Text,..End,*)
    

The argument Text supplies the string that will be modified by the matched region and then returned. The string may contain references to capture groups which take the form of $1, $2, etc. The string may reference the entire matched region with $0.

◆ TextSet()

The TextSet method implements Set assignments to the.

<property>Text</property> property.

Member Data Documentation

◆ End

End

The property End without a subscript contains the character.

position in property <property>Text</property> one beyond of the final character of the string found by the last match.

The value of End(i) when subscripted with an integer i between 1 and <property>GroupCount</property> is the character position one beyond the of the last character of the last string successfully captured by capture group i.

The value of End(i) is -1 if capture group i did not participate in the last match. The values of End and End(i) are -2 if the last match attempt failed.

Note: In addition to integer subscripts between 1 and <property>GroupCount</property>, the value of End(0) is identical to the value of End without a subscript. When the property End(...) is subscripted with values not described above then the attempt to evaluate the property End(...) is undefined.

 

◆ Group

Group

The property Group without a subscript contains the.

string found by the last match.

The value of Group(i) when subscripted with an integer i between 1 and <property>GroupCount</property> is the last string successfully captured by capture group i.

If the last match operation was unsuccessful or if the specified capture group was not used during the last match operation then Group and Group(i) contain the empty string. Note that <property>End</property> and <property>End</property>(i) have negative values when the last match operation did not use the specified capture group or did not succeed in matching.

Note: In addition to integer subscripts between 1 and <property>GroupCount</property>, the value of Group(0) is identical to the value of Group without a subscript. When the property Group(...) is subscripted with values not described above then the attempt to evaluate the property Group(...) is undefined.

 

◆ GroupCount

GroupCount

The property GroupCount contains the number of capturing groups.

in the regular expression <property>Pattern</property>.

 

◆ HitEnd

HitEnd

The property HitEnd is true if the most recent matching.

operation touched the end of property <property>Text</property> at any point during its processing. In this case, appending additional input characters to the <property>Text</property> property could change the result of that match attempt.

 

◆ OperationLimit

OperationLimit

The property OperationLimit provides a way to limit the time taken.

by a regular expression match. The default value for OperationLimit is 0 which indicates that there is no limit. Setting OperationLimit to a positive integer will cause a match operation to signal a TimeOut error after the specified number of clusters of steps by the match engine.

Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but cluster size is chosen such each cluster's execution time will typically be on the order of milliseconds.

 

◆ Pattern

Pattern

The property Pattern is the string representation of the regular.

expression of the Matcher. Assigning to Pattern resets all saved state concerning the last matching operation.

On an installation using an NLS 8-bit character set different from Latin-1 then you you must be careful with patterns using a character class of the form [x-y] where x or y are national usage characters not in Latin-1. All regular expression matching is done in Unicode so characters x and y are converted Unicode. The character class [x-y] reprsents all characters between the Unicode translations of x and y and not the NLS 8-bit characters between x and y.

 

◆ __PreviousMatchEnd

__PreviousMatchEnd
private

PreviousMatchEnd is the End value of the previous match.

It has

value -1 if there is no current match and value 1 if there is a current match but no previous match.

 

◆ RequiredPrefix

RequiredPrefix

The property RequiredPrefix contains a string which, if nonempty, is.

a sequence of characters which must occur at the start of any string which matches the <property>Pattern</property>. A nonempty RequiredPrefix can be used to search a long string for a favorable position to start a Regular Expression matching operation.

In many cases the heuristics used by the ICU library to determine the RequiredPrefix do not include all possible characters of such a prefix. When a prefix cannot be determined, RequiredPrefix will contain the empty string. RequiredPrefix will also contain the empty string if the ICU library used by InterSystems IRIS does not support the RequiredPrefix feature.

 

◆ Start

Start

The property Start without a subscript contains the character.

position in property <property>Text</property> of the first character of the string found by the last match. If the matched string is the empty string then Start is the character position one beyond where the empty string was located (and the property Start equals the property <property>End</property>.)

The value of Start(i) when subscripted with an integer i between 1 and <property>GroupCount</property> is the character position of the first character of the last string successfully captured by capture group i. If the captured string is the empty string then Start(i) is the character position one beyond where the empty string that was captured (and the property Start(i) equals the property <property>End</property>(i).)

The value of Start(i) is -1 if capture group i did not participate in the last match. The values of Start and Start(i) are -2 if the last match attempt failed.

Note: In addition to integer subscripts between 1 and <property>GroupCount</property>, the value of Start(0) is identical to the value of Start without a subscript. When the property Start(...) is subscripted with values not described above then the attempt to evaluate the property Start(...) is undefined.

 

◆ Status

Status

The property Status contains a <class>Status</class> value which may provide more.

information about the last System exception thrown by this object. It is initially $$$OK. Its value remains unchanged by any successful operation. The Status property is changed only when an error is thrown the kernel functions implementing <class>Regex.Matcher</class> or by a COS Set assignment to the Status property done by the user.

 

◆ Text

Text

The property Text is the string to which the regular expression.

will be applied. Assigning to Text resets all saved state resulting from the most recent match operation. On installations using an 8-bit character code, the internal representation of Text is converted to Unicode. Therefore, on an installation using 8-bit characters the maximum length of the Text property is only half the maximum string length supported by that installation.