Regular expressions are strings used to describe particular character patterns. These expressions can be used to match and group text fragments, search for patterns and replace them, or split text into multiple pieces. Example uses include:
- Extracting components from log files such as time, severity and descriptive text
- Converting different phone number styles into a standard format
- Splitting URLs (web page addresses) into the individual components that make up each address
Regular Expressions for IBM SPSS Modeler is a set of new nodes that adds the power of regular expressions to IBM SPSS Modeler. The new nodes are:
- RX Groups: this node matches specific items in a string which then are extracted into new output fields
- RX Split: this node splits a string into separate components using a specified delimiter which are then added to new output fields
- RX Replace: this node matches patterns within a string field and converts them to a different pattern which is added to a new output field
- String Cleaner: this node provides common string cleaning operations (e.g. removing duplicate whitespace or non-printing characters) across multiple input fields in a way that is simple to use
Scalable
Regular Expressions for IBM SPSS Modeler is scalable to millions of data rows. Unlike some approaches which use temporary files, these new nodes process records “in-line” i.e. one at a time in memory. This massively increases processing speed while keeping memory requirements low and removing the need for additional temporary disk space.
Scriptable
The nodes are fully configurable using the Python scripting language provided by IBM SPSS Modeler. Online documentation for each node includes its scripting reference.
System Requirements
- IBM SPSS Modeler 17.x or 18.x
- Windows 7, 8 or 10 64-bit operating system
Trademarks
IBM and SPSS are registered trademarks of International Business Machines Corp.