# Text Processing * [String methods](#string-methods) * [Regular Expressions](#regular-expressions) * [Pattern matching and extraction](#pattern-matching-and-extraction) * [Search and Replace](#search-and-replace) * [Compiling Regular Expressions](#compiling-regular-expressions) * [Further Reading on Regular Expressions](#further-reading-on-regular-expressions)
### String methods * translate string characters * `str.maketrans()` to get translation table * `translate()` to perform the string mapping based on translation table * the first argument to `maketrans()` is string characters to be replaced, the second is characters to replace with and the third is characters to be mapped to `None` * [character translation examples](https://stackoverflow.com/questions/555705/character-translation-using-python-like-the-tr-command) ```python >>> greeting = '===== Have a great day =====' >>> greeting.translate(str.maketrans('=', '-')) '----- Have a great day -----' >>> greeting = '===== Have a great day!! =====' >>> greeting.translate(str.maketrans('=', '-', '!')) '----- Have a great day -----' >>> import string >>> quote = 'SIMPLICITY IS THE ULTIMATE SOPHISTICATION' >>> tr_table = str.maketrans(string.ascii_uppercase, string.ascii_lowercase) >>> quote.translate(tr_table) 'simplicity is the ultimate sophistication' >>> sentence = "Thi1s is34 a senten6ce" >>> sentence.translate(str.maketrans('', '', string.digits)) 'This is a sentence' >>> greeting.translate(str.maketrans('', '', string.punctuation)) ' Have a great day ' ``` * removing leading/trailing/both characters * only consecutive characters from start/end string are removed * by default whitespace characters are stripped * if more than one character is specified, it is treated as a set and all combinations of it are used ```python >>> greeting = ' Have a nice day :) ' >>> greeting.strip() 'Have a nice day :)' >>> greeting.rstrip() ' Have a nice day :)' >>> greeting.lstrip() 'Have a nice day :) ' >>> greeting.strip(') :') 'Have a nice day' >>> greeting = '===== Have a great day!! =====' >>> greeting.strip('=') ' Have a great day!! ' ``` * styling * width argument specifies total output string length ```python >>> ' Hello World '.center(40, '*') '************* Hello World **************' ``` * changing case and case checking ```python >>> sentence = 'thIs iS a saMple StrIng' >>> sentence.capitalize() 'This is a sample string' >>> sentence.title() 'This Is A Sample String' >>> sentence.lower() 'this is a sample string' >>> sentence.upper() 'THIS IS A SAMPLE STRING' >>> sentence.swapcase() 'THiS Is A SAmPLE sTRiNG' >>> 'good'.islower() True >>> 'good'.isupper() False ``` * check if string is made up of numbers ```python >>> '1'.isnumeric() True >>> 'abc1'.isnumeric() False >>> '1.2'.isnumeric() False ``` * check if character sequence is present or not ```python >>> sentence = 'This is a sample string' >>> 'is' in sentence True >>> 'this' in sentence False >>> 'This' in sentence True >>> 'this' in sentence.lower() True >>> 'is a' in sentence True >>> 'test' not in sentence True ``` * get number of times character sequence is present (non-overlapping) ```python >>> sentence = 'This is a sample string' >>> sentence.count('is') 2 >>> sentence.count('w') 0 >>> word = 'phototonic' >>> word.count('oto') 1 ``` * matching character sequence at start/end of string ```python >>> sentence 'This is a sample string' >>> sentence.startswith('This') True >>> sentence.startswith('The') False >>> sentence.endswith('ing') True >>> sentence.endswith('ly') False ``` * split string based on character sequence * returns a list * to split using regular expressions, use `re.split()` instead ```python >>> sentence = 'This is a sample string' >>> sentence.split() ['This', 'is', 'a', 'sample', 'string'] >>> "oranges:5".split(':') ['oranges', '5'] >>> "oranges :: 5".split(' :: ') ['oranges', '5'] >>> "a e i o u".split(' ', maxsplit=1) ['a', 'e i o u'] >>> "a e i o u".split(' ', maxsplit=2) ['a', 'e', 'i o u'] >>> line = '{1.0 2.0 3.0}' >>> nums = [float(s) for s in line.strip('{}').split()] >>> nums [1.0, 2.0, 3.0] ``` * joining list of strings ```python >>> str_list ['This', 'is', 'a', 'sample', 'string'] >>> ' '.join(str_list) 'This is a sample string' >>> '-'.join(str_list) 'This-is-a-sample-string' >>> c = ' :: ' >>> c.join(str_list) 'This :: is :: a :: sample :: string' ``` * replace characters * third argument specifies how many times replace has to be performed * variable has to be explicitly re-assigned to change its value ```python >>> phrase = '2 be or not 2 be' >>> phrase.replace('2', 'to') 'to be or not to be' >>> phrase '2 be or not 2 be' >>> phrase.replace('2', 'to', 1) 'to be or not 2 be' >>> phrase = phrase.replace('2', 'to') >>> phrase 'to be or not to be' ``` **Further Reading** * [Python docs - string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) * [python string methods tutorial](http://www.thehelloworldprogram.com/python/python-string-methods/)
### Regular Expressions * Handy reference of regular expression (RE) elements | Meta characters | Description | | ------------- | ----------- | | `\A` | anchors matching to beginning of string | | `\Z` | anchors matching to end of string | | `^` | anchors matching to beginning of line | | `$` | anchors matching to end of line | | `.` | Match any character except newline character `\n` | | | | OR operator for matching multiple patterns | | `(RE)` | capturing group | | `(?:RE)` | non-capturing group | | `[]` | Character class - match one character among many | | `\^` | prefix `\` to literally match meta characters like `^` |
| Greedy Quantifiers | Description | | ------------- | ----------- | | `*` | Match zero or more times | | `+` | Match one or more times | | `?` | Match zero or one times | | `{m,n}` | Match `m` to `n` times (inclusive) | | `{m,}` | Match at least m times | | `{,n}` | Match up to `n` times (including `0` times) | | `{n}` | Match exactly n times | Appending a `?` to greedy quantifiers makes them non-greedy
| Character classes | Description | | ------------- | ----------- | | `[aeiou]` | Match any vowel | | `[^aeiou]` | `^` inverts selection, so this matches any consonant | | `[a-f]` | `-` defines a range, so this matches any of abcdef characters | | `\d` | Match a digit, same as `[0-9]` | | `\D` | Match non-digit, same as `[^0-9]` or `[^\d]` | | `\w` | Match alphanumeric and underscore character, same as `[a-zA-Z0-9_]` | | `\W` | Match non-alphanumeric and underscore character, same as `[^a-zA-Z0-9_]` or `[^\w]` | | `\s` | Match white-space character, same as `[\ \t\n\r\f\v]` | | `\S` | Match non white-space character, same as `[^\s]` | | `\b` | word boundary, see `\w` for characters constituting a word | | `\B` | not a word boundary |
| Flags | Description | | ------------- | ----------- | | `re.I` | Ignore case | | `re.M` | Multiline mode, `^` and `$` anchors work on lines | | `re.S` | Singleline mode, `.` will also match `\n` | | `re.V` | Verbose mode, for better readability and adding comments | See [Python docs - Compilation Flags](https://docs.python.org/3/howto/regex.html#compilation-flags) for more details and long names for flags
| Variable | Description | | ------------- | ----------- | | `\1`, `\2`, `\3` ... `\99` | backreferencing matched patterns | | `\g<1>`, `\g<2>`, `\g<3>` ... | backreferencing matched patterns, prevents ambiguity | | `\g<0>` | entire matched portion |
### Pattern matching and extraction To match/extract sequence of characters, use * `re.search()` to see if input string contains a pattern or not * `re.findall()` to get a list of all matching patterns * `re.split()` to get a list from splitting input string based on a pattern Their syntax is as follows: ```python re.search(pattern, string, flags=0) re.findall(pattern, string, flags=0) re.split(pattern, string, maxsplit=0, flags=0) ``` * As a good practice, always use **raw strings** to construct RE, unless other formats are required * this will avoid clash of backslash escaping between RE and normal quoted strings * examples for `re.search` ```python >>> sentence = 'This is a sample string' # using normal string methods >>> 'is' in sentence True >>> 'xyz' in sentence False # need to load the re module before use >>> import re # check if 'sentence' contains the pattern described by RE argument >>> bool(re.search(r'is', sentence)) True >>> bool(re.search(r'this', sentence, flags=re.I)) True >>> bool(re.search(r'xyz', sentence)) False ``` * examples for `re.findall` ```python # match whole word par with optional s at start and e at end >>> re.findall(r'\bs?pare?\b', 'par spar apparent spare part pare') ['par', 'spar', 'spare', 'pare'] # numbers >= 100 with optional leading zeros >>> re.findall(r'\b0*[1-9]\d{2,}\b', '0501 035 154 12 26 98234') ['0501', '154', '98234'] # if multiple capturing groups are used, each element of output # will be a tuple of strings of all the capture groups >>> re.findall(r'(x*):(y*)', 'xx:yyy x: x:yy :y') [('xx', 'yyy'), ('x', ''), ('x', 'yy'), ('', 'y')] # normal capture group will hinder ability to get whole match # non-capturing group to the rescue >>> re.findall(r'\b\w*(?:st|in)\b', 'cost akin more east run against') ['cost', 'akin', 'east', 'against'] # useful for debugging purposes as well before applying substitution >>> re.findall(r't.*?a', 'that is quite a fabricated tale') ['tha', 't is quite a', 'ted ta'] ``` * examples for `re.split` ```python # split based on one or more digit characters >>> re.split(r'\d+', 'Sample123string42with777numbers') ['Sample', 'string', 'with', 'numbers'] # split based on digit or whitespace characters >>> re.split(r'[\d\s]+', '**1\f2\n3star\t7 77\r**') ['**', 'star', '**'] # to include the matching delimiter strings as well in the output >>> re.split(r'(\d+)', 'Sample123string42with777numbers') ['Sample', '123', 'string', '42', 'with', '777', 'numbers'] # use non-capturing group if capturing is not needed >>> re.split(r'hand(?:y|ful)', '123handed42handy777handful500') ['123handed42', '777', '500'] ``` * backreferencing ```python # whole words that have at least one consecutive repeated character >>> words = ['effort', 'flee', 'facade', 'oddball', 'rat', 'tool'] >>> [w for w in words if re.search(r'\b\w*(\w)\1\w*\b', w)] ['effort', 'flee', 'oddball', 'tool'] ``` * The `re.search` function returns a `re.Match` object from which various details can be extracted like the matched portion of string, location of matched portion, etc * **Note** that output here is shown for Python version **3.7** ```python >>> re.search(r'b.*d', 'abc ac adc abbbc') # retrieving entire matched portion >>> re.search(r'b.*d', 'abc ac adc abbbc')[0] 'bc ac ad' # capture group example >>> m = re.search(r'a(.*)d(.*a)', 'abc ac adc abbbc') # to get matched portion of second capture group >>> m[2] 'c a' # to get a tuple of all the capture groups >>> m.groups() ('bc ac a', 'c a') ```
### Search and Replace **Syntax** ```python re.sub(pattern, repl, string, count=0, flags=0) ``` * examples * **Note** that as strings are immutable, `re.sub` will not change value of variable passed to it, has to be explicity assigned ```python >>> ip_lines = "catapults\nconcatenate\ncat" >>> print(re.sub(r'^', r'* ', ip_lines, flags=re.M)) * catapults * concatenate * cat # replace 'par' only at start of word >>> re.sub(r'\bpar', r'X', 'par spar apparent spare part') 'X spar apparent spare Xt' # same as: r'part|parrot|parent' >>> re.sub(r'par(en|ro)?t', r'X', 'par part parrot parent') 'par X X X' # remove first two columns where : is delimiter >>> re.sub(r'\A([^:]+:){2}', r'', 'foo:123:bar:baz', count=1) 'bar:baz' ``` * backreferencing ```python # remove any number of consecutive duplicate words separated by space # quantifiers can be applied to backreferences too! >>> re.sub(r'\b(\w+)( \1)+\b', r'\1', 'a a a walking for for a cause') 'a walking for a cause' # add something around the matched strings >>> re.sub(r'\d+', r'(\g<0>0)', '52 apples and 31 mangoes') '(520) apples and (310) mangoes' # swap words that are separated by a comma >>> re.sub(r'(\w+),(\w+)', r'\2,\1', 'a,b 42,24') 'b,a 24,42' ``` * using functions in replace part of `re.sub()` * **Note** that Python version **3.7** is used here ```python >>> from math import factorial >>> numbers = '1 2 3 4 5' >>> def fact_num(n): ... return str(factorial(int(n[0]))) ... >>> re.sub(r'\d+', fact_num, numbers) '1 2 6 24 120' # using lambda >>> re.sub(r'\d+', lambda m: str(factorial(int(m[0]))), numbers) '1 2 6 24 120' ``` * [call functions from re.sub](https://stackoverflow.com/questions/11944978/call-functions-from-re-sub) * [replace string pattern with output of function](https://stackoverflow.com/questions/12597370/python-replace-string-pattern-with-output-of-function) * [lambda tutorial](https://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/)
### Compiling Regular Expressions * Regular expressions can be compiled using `re.compile` function, which gives back a `re.Pattern` object * The top level `re` module functions are all available as methods for this object * Compiling a regular expression helps if the RE has to be used in multiple places or called upon multiple times inside a loop (speed benefit) * By default, Python maintains a small list of recently used RE, so the speed benefit doesn't apply for trivial use cases ```python >>> pet = re.compile(r'dog') >>> type(pet) >>> bool(pet.search('They bought a dog')) True >>> bool(pet.search('A cat crossed their path')) False >>> remove_parentheses = re.compile(r'\([^)]*\)') >>> remove_parentheses.sub('', 'a+b(addition) - foo() + c%d(#modulo)') 'a+b - foo + c%d' >>> remove_parentheses.sub('', 'Hi there(greeting). Nice day(a(b)') 'Hi there. Nice day' ```
### Further Reading on Regular Expressions * [Python re(gex)?](https://github.com/learnbyexample/py_regular_expressions) - a book on regular expressions * [Python docs - re module](https://docs.python.org/3/library/re.html) * [Python docs - introductory tutorial to using regular expressions](https://docs.python.org/3/howto/regex.html) * [Comprehensive reference: What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) * [rexegg](https://www.rexegg.com/) - tutorials, tricks and more * [regular-expressions](https://www.regular-expressions.info/) - tutorials and tools * [CommonRegex](https://github.com/madisonmay/CommonRegex) - collection of common regular expressions * Practice tools * [regex101](https://regex101.com/) - visual aid and online testing tool for regular expressions, select flavor as Python before use * [regexone](https://regexone.com/) - interative tutorial * [cheatsheet](https://www.shortcutfoo.com/app/dojos/python-regex/cheatsheet) - one can also learn it [interactively](https://www.shortcutfoo.com/app/dojos/python-regex) * [regexcrossword](https://regexcrossword.com/) - practice by solving crosswords, read 'How to play' section before you start