Regular Expressions (RegEx)
Regular Expressions or RegEx, are a standardised technique for matching sub‑strings within strings. It is the pattern or formula that is to be applied to the source string to find any matching sub‑strings.
Regular Expressions are enabled in various scripting and programming languages, including Statelake scripting.
Regular Expressions can be used to confirm or validate the pattern of characters within a string, and they can be used to replace or extract portions of a string that match the specified RegEx pattern.
The key point is that RegEx can match on patterns within a character sequence - whereas conventional string searching looks for exact matches.
For example, if we wanted to use the Statelake function Pos to search for digits – Pos returns the position of the first occurrence of a specified pattern in a specified string.
So if you were using Pos, if you were searching for any instance of the numbers 0 through to 9 in a string of text, we would have to search for 0 as an individual number, and then do a separate search for 1 as an individual number, and so on, until every number had been searched for up to 9. Please refer to Pos for more information about this function.
In contrast, the function RegExMatch in conjunction with the RegEx symbol \d will find a matching digit regardless of its value.
RegEx has an extensive syntax.
RegEx patterns can range from simple to complex, and there are many options for adjusting and refining the behaviour of a RegEx pattern.
Regular Expressions have a common base of functionality across the scripting and programming languages in which they are enabled, but the various implementations also have some differences. Information on advanced RegEx features can be found on the internet and elsewhere.
The following pages will introduce some of the basic syntax, introduce the base functionality, and explore any behaviours that are specific to Statelake.
Statelake implements the main functionality of RegEx through these main functions -
RegExMatch
RegExSplit
RegExExtract
RegExReplace
All four return a boolean True or False value - True indicating that a match was found and, where relevant, that extraction, replacement, or splitting has occurred, and False indicating that no match was found.
The RegExExtract, RegExReplace, and RegExSplit functions can be useful while you are building your confidence with RegEx: The RegExMatch function will report whether a match has been found, but it does not show what has been matched - submitting the same string and RegEx pattern to RegExExtract and/or RegExSplit will give you that answer.
RegEx assigns meaning to specific characters so that they can be used to define Regular Expressions. The characters in the various classes are listed below.
Ordinary characters
Character Matching | Description |
---|---|
x | Most characters seek a match when used as is - i.e. a to match b for b, or to match abc to abc, etc. Characters such as . \ [ ] ( ) { } ? * + ^ $ | have a special meaning in RegEx. To seek a match on one of these special characters, they must be escaped with a backslash (\) - e.g. use \. to match a period or decimal point. The backslash (\) is also used ahead of non-special characters to build a symbol that has a special meaning, such as \d to match a digit. |
[…] | Match any of the characters enclosed within the brackets, e.g. [abc] will match to a or b or c. |
[^…] | Match any character not included in the bracketed list - e.g. [^abc] will match to d, 7, e.t.c., and to any character except a or b or c. |
[x-y] | Match any character from x to y inclusive, e.g. [a-z] will match any lowercase letter. |
| | The pipe (|) will match either/or - e.g. \b(black|white)\b will match black or white. |
Escape Character | Description |
---|---|
\ | Use the backslash (\) ahead of characters such as . \ [ ] ( ) { } ? * + ^ $ | (or - inside […]) to match them as a character and not as a symbol. |
Case Sensitivity | Description |
---|---|
(?i) | Sets the matching of the following RegEx string (or part string) to be case-insensitive. Case-insensitive is the default in Statelake RegEx. |
(?-i) | Sets the matching of the following RegEx string (or part string) to be case-sensitive, i.e. turns off case-insensitivity. |
Symbol characters
Symbol Matching | Description |
---|---|
Wild card – matches any single character (*** line-breaks?) | |
\d | Matches any digit - equivalent to [0123456789] or [0-9] |
\D | Matches any character that is not a digit. |
\w | Matches any digit, letter, or underscore (_). |
\W | Matches any character that is not a digit, letter, or an underscore (_). |
space | Space character - matches a single space. |
\t | Matches a Tab character (hex 09). |
\r | Matches a Carriage Return character (hex 0D). |
\n | Matches a New Line character (hex 0A). |
\s | Matches whitespace i.e. space, Tab, \r, \n (*** Unicode?) |
\S | Matches any character that is not a whitespace character. |
Repetition Symbols | Description |
---|---|
X? | 0 or 1 consecutive occurrences of X. |
X* | 0, 1, or more consecutive occurrences of X. |
X+ | 1 or more consecutive occurrences of X. |
X{m} | Exactly m consecutive occurrences of X. |
X{m,} | At least m consecutive occurrences of X. |
X{m,n} | Between m and n (inclusive) consecutive occurrences of X. |
(…) | Use parentheses to define groups, e.g. for repetition. |
Boundary matching
Symbol | Description |
---|---|
\A | The beginning of the string. |
\Z | The end of the string, ahead of any final line-break. |
\z | The end of the string, after any final line-break. |
^ | The beginning of a line, i.e. after a line-break. |
$ | The end of a line, ahead of the line-break. |
\b | Transition from \W to \w, or from \w to \W (i.e. “words”). |
\B | Transition from \w to \W, or from \W to \w (i.e. non-words) |