Pattern Matching - Extraction
- pulling substrings that match a pattern our of a string
- String contain groups of patterns
Extraction Examples
- extract email address from a customer message
- extract links or content from a html tag
- look for valid textual input from a survey
- brand name information from a product description
Extraction Groups
- a group is defined by enclosing a pattern in parenthesis
- a pattern con contain several groups
- Groups are referred to by the order they were defined (starting at 1 )
- Group zero refers to the entire string
Grouping Characters
(...)
# starts and ends a group of regex patterns
(?P<name>...)
# starts and ends a named group name of regex patterns
match_object.group(group1, group2,...)
# return a string of the match specified by group
# group can be either an integer index or a string
# if no arguments are passed, defaults to group zero
Match Object Methods
match_object.group()
# return a tuple of all the subgroups of the match
Match Groups Example 1
import re
pattern = '(\w*)ware'
# prefix ware
for word in ['software','hardware','dinnerware']:
#3 words we want to extract from
match = re.search(pattern, word)
if match:
print(match.group(1), match.groups())
# oprint the first group [match.group(1)] and the entire word [match.groups()]
# OUTPUT
# soft software
# hard hardware
# dinner dinnerware
Match Groups Example 2
import re
pattern = '([A-z ]+)\s?(\d+)'
# first group letter, second of numbers, in between optional white space character
for word in ['S100','Porsch911','dinnerware', 'Cyperdyne 101']:
#products we want to extract from
match = re.search(pattern, word)
if match:
print(f'Brand: {match.group(1)}\tModel: {match.groups(2)}')
# match.group(1) refers to the part with the letters
# OUTPUT
# Brand: S Model: 100
# Brand: Porsche Model: 911
# Brand: Cyperdyne Model: 101
Named Groups
- instead of number we can also use names of groups
- group() accepts the string name as an argument
- Defining a group anme consists of adding the following string after the opening parenthesis: ?P
Named Match Groups Example
import re
pattern = '(?P<brand>[A-z ]+)\s?(?P<model>\d+)'
for word in ['S100','Porsch911','dinnerware', 'Cyperdyne 101']:
match = re.search(pattern, word)
if match:
print(f'Brand: {match.group("brand")}\tModel: {match.groups("model")}')
# OUTPUT
# Brand: S Model: 100
# Brand: Porsche Model: 911
# Brand: Cyperdyne Model: 101