For G Suite and Office 360 Users, and for Python Developers
Language is rarely simple to work with when extracting insights. Even the task of finding and counting how many times a word appears in a body of text, a task that seems at first glance like it should be simple to do with modern word processors and coding libraries can run into complications quickly.
Take the example of trying to find if the sky is mentioned in the body of the text. The simple approach would be to just:
- Ctrl + F “sky” (if using a text editor or word processor)
- “sky” in body_of_text (if writing a script in python)
However, both approaches will not always find mentions of the sky, and will sometimes yield false positives. The limitations of these approaches come from how they are searching for the exact sequence of characters s-k-y. Listed below are some features of language that will cause this approach to fail:
- Capitalization: If a sentence begins with “Sky” or “Sky” appears as part of a proper noun (or if the author of the text had a broken caps lock key) these appearances of the word will not be found.
- Spellings of other words: The presence of words such as “pesky”, “risky”, etc will cause “sky” to be falsely detected.
- Plurals: “Skies” will be undetected.
Many word processors and text editors do have options available in their “Find and Replace” functions for addressing the first two of the above issues. In fact, typically options such as “Ignore Case” and “Match Whole Word” are turned on by default. However, the issue of missing alternative forms of a word (plurals, tenses, gender, etc) or alternative spellings (for example the many variations of the name Mohammad) will still be present.
With Python you could expand your search_phrase in body_of_text line of code by adding conditionals to cover every variation of the word and to exclude the false positive matches. But this can be a pain to do and can make your code for this supposedly simple task long and difficult to maintain.
Enter Regular Expressions
Regular expressions (or regex for short) are powerful tools for finding patterns of characters inside text. The idea is, instead of giving your code or word processor a specific sequence of characters to find, you give it a pattern (a regular expression) that describes the rules that you would like to use to broaden or restrict which segments of the text match the given pattern.
Specifying these rules is achieved through adding special characters to your regular expression. Take the above example of “sky”. A regular expression of “\bsk(y|ies)\b” will match both “sky” and “skies”, but will not match “pesky”, “risky”, or any other word containing s-k-y. Let’s break down the purpose of each of the special characters we’ve added:
The “\b” flanking each end of our expression specify word boundaries, i.e white-space, punctuation, the start or end of the text, etc. This is what prevents the regex from matching other words where “sky” forms part of the spelling (like “pesky” or “Skype”).
The parenthesis create a capture group. This is used for specifying which part of your regex you would like to apply a specific rule to. Within this capture group, we have the “|” (aka the vertical bar). This acts as an OR operator, specifying that either the expression to the left (“y”) or the expression to the right (“ies”) can form part of a match. This allows our regex to match both the singular “sky” and the plural “skies”.
For addressing the issue of Capitalisation, usually this is a non-issue as the ‘Ignore Case’ option in the Find functions in most word processors and text editors can also be used when searching using regex. When using the the Python regex library re, the IGNORECASE flag can be used when searching via regex (more on that later). However, without this option or this flag we can match “Sky”, “sky”, “Skies” and “skies” by augmenting our regex as such: “\b(S|s)k(y|ies)\b”
So as you can see, regular expressions has made our search for the sky a lot more fruitful! However this was relatively simple example of what regex can do. Listed below are a few more examples that demonstrate the power and versatility of regular expressions:
- M(u|o)hamm?(e|a)(d|t) matches most spelling variations of Mohammad
- (?<!Mountain )Bike matches Bike but not Mountain Bike
- coffee(?! date) matches coffee but not coffee date
- (https:\/\/)?(www\.)?[a-z0–9]+\.[a-z]+(\.[a-z]+)? matches many urls
- [a-zA-Z0–9-_.]+@[a-zA-Z]+\.[a-zA-Z]+(\.[a-zA-Z]+)? matches many email addresses
regexr : A Great Website for Learning Regex
The examples described earlier may have made you think “that looks useful, but how am I going to remember all the different special characters and what they do? And how can I tell if my regular expression works as intended?”. Don’t worry, regexr.com is here to help.
On this website you can build regular expressions, using the provided cheat sheets as guides, and paste in some text to test them out. Any matches will be highlighted in the text and a count of the matches displayed. Your regex is color coded and handy tool tips describe what each character does when hovering your mouse over them. This is great for learning regex by using regex.
Using Regex in Google Docs and Google Sheets
Using Regex in these G Suite apps is straightforward and user friendly. Simply press Ctrl + F, then click the three vertical dots in the search box that appears in the top left corner of the browser. Alternatively, navigate to Edit -> Find and Replace in the drop down menus. This opens the Find and Replace window. Here you will find the Match using regular expressions (Docs) or the Search using regular expressions (Sheets) checkbox. Once selected, you can paste in you regular expression into the Find field. Any matches will be highlighted in your document and a count of matches will display in the Find field. You can still use the other Find and replace options available, such as Match case, and your regex will behave accordingly.
Using “Regex” (wildcards) in MS Word and Excel
Unfortunately regular expressions are not supported in Word and Excel. However, they do allow for “wildcards”, which behave much the same way, though the symbols used are sometimes different.
In Word, searching using wildcards can be achieved using the Advanced Find and Replace feature. Details on how to navigate to this feature, how to enable wildcards, and what each wildcard character does can be found in this guide provided on the Office support website.
In Excel, it’s not immediately obvious how to search using wildcards as it is not part of any search feature. However, you can find cells matching your search phrase containing wildcards via filtering your sheets using Advanced Filters. Follow this guide from the Office support website where you can find how to use wildcards as part of your advanced filters under “Wildcard criteria”.
Using Regex in Python
The re library is provided as part of the base Python installation. This library provides many functions that allow you to analyze and manipulate strings using regular expressions.
A few key functions are:
- re.search(pattern, string, flags=0): This looks for the first location in string where pattern (your regular expression) matches, returning a match object describing the match, or None if no match found.
- re.match(pattern, string, flags=0): Behaves the same as re.search(), except it only looks for matches that start at the beginning of string
- re.split(pattern, string, flags=0) : Splits string into a list by matches to pattern, like the str.split() method but with regex.
- re.sub(pattern, repl, string, count=0, flags=0) : Replace occurrences of pattern in string with repl, if repl is a string, or with the output of repl if repl is a function that takes a single match object as an argument
The flags parameter allows you to specify flags such as re.IGNORECASE that modify the behavior of your regex patterns accordingly.
When writing your regex patterns as strings in python, it is helpful to use raw strings, as these allow you to use special characters such as “\” without having to escape them. For example, instead of:
- pattern = “\\bsk(y|ies)\\b”
- pattern = r”\bsk(y|ies)\b”
Full documentation for the re library can be found here: https://docs.python.org/3/library/re.html#module-re
As you can see, regular expressions can make so many tasks a lot easier when working with text. Even learning just a few special characters can allow for so many new uses of Find and Replace tools. Many word processors and text editors have recognized their usefulness and have allowed support for regular expressions, or have implemented their own features to provide similar functionality. Familiarity with regular expressions is a handy tool in the toolkit for any writer, software developer or anyone who analyses text often.