RX Groups Node
Overview
Settings
Advanced Settings
Examples
Scripting
Overview
Regular expressions are special text strings which are used to describe particular character patterns. The RX Groups node allows regular expressions to match specific items in a string which are in turn added to new output fields. The generated output fields are all strings. Unlike the RX Split node which defines a delimiter that separates the values of interest, the RX Groups node supports extracting components with different patterns.
The node generates new string fields containing the groups that have been extracted from the input field along with a field containing the full match.
The node uses the ICU Regular Expressions package. Full details can be found here.
Settings
Match field
This is used to select the string field containing the text that should be split by the Pattern.
Prefix match field to field names
This specifies how the new field names should be generated:
- when checked (which is the default), the new field names are generated by joining the name of the Match field to the All match name value for the match field and also to the group field names
- when unchecked, the new field names are based on the All match name value and the group names defined by Group names
Pattern
This defines the actual regular expression that will be matched against content of the Match field. Common regular expression components can be viewed and added by using the context menu in the Pattern text area.
Regular Expression Options…
These are described in Advanced Settings below.
All match name
This defines either the suffix which will be appended to the Match field or the full name of the new field, depending on the setting of Prefix match field to field names. The resulting output field contains the full pattern that was matched or $null$
if the input field did not match the regular expression in Pattern.
Group names
This defines the name and number of output fields that will be generated for each group defined in Pattern.
Reference groups
This defines how the groups defined by the Pattern are mapped to output fields:
- By position: (the default setting) this will produce output fields based on the order in which the groups are defined by Pattern i.e. the first column will contain the first group matched, the second column the second group matched etc. The Pattern regular expression may include group names but they will be ignored.
- By name: this will produce output fields based on the group names defined by Pattern
Advanced Settings
These settings control the general behaviour of the regular expression matcher. The default is for all settings to be unchecked. These can generally be left in their default state.
Case insensitive
When checked, regular expression matching will ignore character case.
Multiline
By default, ^
and $
match the start and end of the input text. When checked, ^
and $
will also match the start and end of each line within the input text.
Match ‘.’ as line terminator
When checked, a .
in a pattern will match a line terminator in the input text which by default it will not.
Comments in patterns
When checked, white space and #comments are allowed within regular expression patterns.
Use Unicode word boundaries
This controls the behaviour of \b
in a pattern. When checked, word boundaries are found according to the definitions of word found in Unicode UAX 29.
Examples
All examples assume other settings are set to default.
Extract URL Protocol By Position
A URL includes a protocol (such as http
, https
or ftp
) as the first item up to the :
. The pattern assumes that the protocol only contains alpha characters although some protocols can include numbers (e.g. Amazon s3://
).
Match field: URL
Pattern: ([[:alpha:]]+):
All match name: _MATCH
Group names:
Protocol
Reference groups: By position
URL | URL_MATCH | URLProtocol | Notes |
---|---|---|---|
https://www.amazon.co.uk/ | https: | https | Valid URL |
http://www.icu-project.org/apiref/ | http: | http | Valid URL |
www.dmg.org | $null$ | $null$ | Missing protocol, no match |
Extract URL Protocol By Group Name
This example is similar to the previous one. However, in this case, the group is given a name of Protocol
in the Pattern which is then used to reference the named output field by changing Reference groups to By name
. Note that the results are the same as the previous example, and would also be the same even if Reference groups was changed to By position
because there is only a single group being matched.
Match field: URL
Pattern: (?<Protocol>[[:alpha:]]+):
All match name: _MATCH
Group names:
Protocol
Reference groups: By name
URL | URL_MATCH | URLProtocol | Notes |
---|---|---|---|
https://www.amazon.co.uk/ | https: | https | Valid URL |
http://www.icu-project.org/apiref/ | http: | http | Valid URL |
www.dmg.org | $null$ | $null$ | Missing protocol, no match |
Extract URL Protocol And Remainder By Group Name
This is similar to the previous example except this time, all the text after the *protocol*://
component is also matched to a group.
Match field: URL
Pattern: (?<Protocol>[[:alpha:]]+)://(?<Remainder>.*)
All match name: _MATCH
Group names:
Protocol
Remainder
Reference groups: By name
URL | URL_MATCH | URLProtocol | URLRemainder | Notes |
---|---|---|---|---|
https://www.amazon.co.uk/ | https://www.amazon.co.uk/ | https | www.amazon.co.uk/ | Valid URL |
http://www.icu-project.org/apiref/ | http://www.icu-project.org/apiref/ | http | www.icu-project.org/apiref/ | Valid URL |
www.dmg.org | $null$ | $null$ | $null$ | No match |
Scripting
Settings
Node type name: regexp_groups
Setting | Property | Type | Comment |
---|---|---|---|
Match field | match_field | Field | – |
Prefix match field to field names | prefix_match_field | Boolean | – |
Pattern | pattern | String | – |
All match name | all_match_name | String | – |
Group names | group_names | String List | – |
Reference groups | grouping | position or name | – |
Case insensitive | opt_case_insensitive | Boolean | – |
Multiline | opt_multiline | Boolean | – |
Match ‘.’ as line terminator | opt_dotall | Boolean | – |
Comments in patterns | opt_comments | Boolean | – |
Use Unicode word boundaries | opt_uword_boundaries | Boolean | – |
Scripting Example
node = modeler.script.stream().createAt("regexp_groups", u"RX Groups", 512, 192)
node.setPropertyValue("match_field", u"URL")
node.setPropertyValue("prefix_match_field", False)
node.setPropertyValue("pattern", u"(?<Protocol>[[:alpha:]]+)://(?<Remainder>.*)")
node.setPropertyValue("all_match_name", u"Matched")
node.setPropertyValue("group_names", [u"Protocol", u"Remainder"])
node.setPropertyValue("grouping", u"name")