String Cleaner Node
Overview
Settings
Examples
Scripting
Overview
The String Cleaner node provides common string cleaning operations in a way that is simple to use. Operations including removing non-standard characters, trimming leading and/or trailing whitespace and capitalization. These operations can be applied across multiple input fields and a new string output field will be created for each input field.
Settings
The node settings are split into related sub-groups.
Fields
Clean fields
This is used to select which string fields should be cleaned.
Output suffix
New field names are generated by joining the name of each field selected in Clean fields to the output suffix.
Clean fields: HomePhone
, MobilePhone
Output suffix: _Cleaned
Output fields generated: HomePhone_Cleaned
, MobilePhone_Cleaned
Whitespace
Leading and trailing spaces
This specifies how strings should be trimmed:
- None: (the default setting) the string is not trimmed
- Left: removes spaces at the start of the string
- Right: removes spaces at the end of the string
- Both: removes spaces at the start and end of the string
Replace tab with space
When checked, tab characters will be replaced with space characters.
Replace duplicate space or tab with space
When checked, 2 or more adjacent space or tab characters will be replaced with a single space character.
Capitalization
Capitalize
This specifies how character case should be changed in the string:
- Leave unchanged: (the default setting) character cases are not modified
- ALL UPPER CASE: Any lower case characters are converted to the equivalent upper case characters
- all lower case: Any upper case characters are converted to the equivalent lower case characters
Character Categories
Categories
This section lists various character categories which can be checked or unchecked.
- Upper case English characters: characters representing the letters
A
toZ
- Lower case English characters: characters representing the letters
a
toz
- Digits: characters representing the numbers
0
to9
- Punctuation: punctuation characters are
!'#$%&'()*+,-./:;<=>?@[/]^_{|}~
- Blanks: space or tab characters
- Spaces: space, tab, new line, vertical tab, form feed or carriage return characters
- Non-printing characters: other characters that are not normally visible but can sometimes be included in strings
Category handling
This specifies how character categories should be handled:
- Remove selected categories: (the default setting) character cases are not modified
- Keep selected categories and remove others: Any lower case characters are converted to the equivalent upper case characters
Examples
All examples assume other settings are set to default.
Clean phone numbers
This removes any non-digit characters from phone number strings. Clean fields: MobilePhone
Output suffix: _Cleaned
Digits: checked
Category handling: Keep selected categories and remove others
MobilePhone | MobilePhone_Cleaned | Notes |
---|---|---|
+44 1234 56789 | 44123456789 | – |
(555) 567890 | 555567890 | – |
Scripting
Settings
Node type name: regexp_cleaner
Setting | Property | Type | Comment |
---|---|---|---|
Clean fields | clean_fields | String List | – |
Output suffix | output_suffix | String | – |
Trim | trim_mode | none , left , right or both | – |
Replace tab with space | replace_tabs | Boolean | – |
Replace duplicate space or tab with space | replace_duplicate_blanks | Boolean | – |
Capitalize | capitalize_mode | none , upper or lower | – |
Upper case English characters | find_upper_english_chars | Boolean | – |
Lower case English characters | find_lower_english_chars | Boolean | – |
Digits | find_digits | Boolean | – |
Punctuation | find_punctuation | Boolean | – |
Blanks | find_blanks | Boolean | – |
Spaces | find_spaces | Boolean | – |
Non-printing characters | find_non_printing_chars | Boolean | – |
Category handling | categories_mode | remove or keep | – |
Scripting Example
node = modeler.script.stream().createAt("regexp_cleaner", u"String Cleaner", 512, 192)
node.setPropertyValue("clean_fields", [u"HomePhone", u"MobilePhone"])
node.setPropertyValue("output_suffix", u"_processed")
node.setPropertyValue("trim_mode", u"both")
node.setPropertyValue("replace_tabs", True)
node.setPropertyValue("replace_duplicate_blanks", True)
node.setPropertyValue("capitalize_mode", u"none")
node.setPropertyValue("find_upper_english_chars", False)
node.setPropertyValue("find_lower_english_chars", False)
node.setPropertyValue("find_digits", True)
node.setPropertyValue("find_punctuation", False)
node.setPropertyValue("find_blanks", False)
node.setPropertyValue("find_spaces", False)
node.setPropertyValue("find_non_printing_chars", False)
node.setPropertyValue("categories_mode", u"keep")