Skip to content

Latest commit

 

History

History
484 lines (368 loc) · 12.5 KB

File metadata and controls

484 lines (368 loc) · 12.5 KB

Text Processing


String methods

  • translate string characters
    • str.maketrans() to get translation table
    • translate() to perform the string mapping based on translation table
  • the first argument to maketrans() is string characters to be replaced, the second is characters to replace with and the third is characters to be mapped to None
  • character translation examples
>>> greeting = '===== Have a great day ====='
>>> greeting.translate(str.maketrans('=', '-'))
'----- Have a great day -----'

>>> greeting = '===== Have a great day!! ====='
>>> greeting.translate(str.maketrans('=', '-', '!'))
'----- Have a great day -----'

>>> import string
>>> quote = 'SIMPLICITY IS THE ULTIMATE SOPHISTICATION'
>>> tr_table = str.maketrans(string.ascii_uppercase, string.ascii_lowercase)
>>> quote.translate(tr_table)
'simplicity is the ultimate sophistication'

>>> sentence = "Thi1s is34 a senten6ce"
>>> sentence.translate(str.maketrans('', '', string.digits))
'This is a sentence'
>>> greeting.translate(str.maketrans('', '', string.punctuation))
' Have a great day '
  • removing leading/trailing/both characters
  • only consecutive characters from start/end string are removed
  • by default whitespace characters are stripped
  • if more than one character is specified, it is treated as a set and all combinations of it are used
>>> greeting = '      Have a nice day :)     '
>>> greeting.strip()
'Have a nice day :)'
>>> greeting.rstrip()
'      Have a nice day :)'
>>> greeting.lstrip()
'Have a nice day :)     '

>>> greeting.strip(') :')
'Have a nice day'

>>> greeting = '===== Have a great day!! ====='
>>> greeting.strip('=')
' Have a great day!! '
  • styling
  • width argument specifies total output string length
>>> ' Hello World '.center(40, '*')
'************* Hello World **************'
  • changing case and case checking
>>> sentence = 'thIs iS a saMple StrIng'

>>> sentence.capitalize()
'This is a sample string'

>>> sentence.title()
'This Is A Sample String'

>>> sentence.lower()
'this is a sample string'

>>> sentence.upper()
'THIS IS A SAMPLE STRING'

>>> sentence.swapcase()
'THiS Is A SAmPLE sTRiNG'

>>> 'good'.islower()
True

>>> 'good'.isupper()
False
  • check if string is made up of numbers
>>> '1'.isnumeric()
True
>>> 'abc1'.isnumeric()
False
>>> '1.2'.isnumeric()
False
  • check if character sequence is present or not
>>> sentence = 'This is a sample string'
>>> 'is' in sentence
True
>>> 'this' in sentence
False
>>> 'This' in sentence
True
>>> 'this' in sentence.lower()
True
>>> 'is a' in sentence
True
>>> 'test' not in sentence
True
  • get number of times character sequence is present (non-overlapping)
>>> sentence = 'This is a sample string'
>>> sentence.count('is')
2
>>> sentence.count('w')
0

>>> word = 'phototonic'
>>> word.count('oto')
1
  • matching character sequence at start/end of string
>>> sentence
'This is a sample string'

>>> sentence.startswith('This')
True
>>> sentence.startswith('The')
False

>>> sentence.endswith('ing')
True
>>> sentence.endswith('ly')
False
  • split string based on character sequence
  • returns a list
  • to split using regular expressions, use re.split() instead
>>> sentence = 'This is a sample string'

>>> sentence.split()
['This', 'is', 'a', 'sample', 'string']

>>> "oranges:5".split(':') 
['oranges', '5']
>>> "oranges :: 5".split(' :: ') 
['oranges', '5']

>>> "a e i o u".split(' ', maxsplit=1) 
['a', 'e i o u']
>>> "a e i o u".split(' ', maxsplit=2) 
['a', 'e', 'i o u']

>>> line = '{1.0 2.0 3.0}'
>>> nums = [float(s) for s in line.strip('{}').split()]
>>> nums
[1.0, 2.0, 3.0]
  • joining list of strings
>>> str_list
['This', 'is', 'a', 'sample', 'string']
>>> ' '.join(str_list)
'This is a sample string'
>>> '-'.join(str_list)
'This-is-a-sample-string'

>>> c = ' :: '
>>> c.join(str_list)
'This :: is :: a :: sample :: string'
  • replace characters
  • third argument specifies how many times replace has to be performed
  • variable has to be explicitly re-assigned to change its value
>>> phrase = '2 be or not 2 be'
>>> phrase.replace('2', 'to')
'to be or not to be'

>>> phrase
'2 be or not 2 be'

>>> phrase.replace('2', 'to', 1)
'to be or not 2 be'

>>> phrase = phrase.replace('2', 'to')
>>> phrase
'to be or not to be'

Further Reading


Regular Expressions

  • Handy reference of regular expression elements
Meta characters Description
^ anchor, match from beginning of string
$ anchor, match end of string
. Match any character except newline character \n
| OR operator for matching multiple patterns
() for grouping patterns and also extraction
[] Character class - match one character among many
\^ use \ to match meta characters like ^

Quantifiers Description
* Match zero or more times the preceding character
+ Match one or more times the preceding character
? Match zero or one times the preceding character
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n times but not more than m times

Character classes Description
[aeiou] Match any vowel
[^aeiou] ^ inverts selection, so this matches any consonant
[a-f] Match any of abcdef character
\d Match a digit, same as [0-9]
\D Match non-digit, same as [^0-9] or [^\d]
\w Match alphanumeric and underscore character, same as [a-zA-Z_]
\W Match non-alphanumeric and underscore character, same as [^a-zA-Z_] or [^\w]
\s Match white-space character, same as [\ \t\n\r\f\v]
\S Match non white-space character, same as [^\s]
\b word boundary, word defined as sequence of alphanumeric characters
\B not a word boundary

Compilation Flags Description
re.I ignore case
re.M multiline mode, ^ and $ anchors work on internal lines
re.S singleline mode, . will also match \n
re.V verbose mode, for better readability and adding comments

Variable Description
\1, \2, \3 etc backreferencing matched patterns
\g<1>, \g<2>, \g<3> etc backreferencing matched patterns, useful to differentiate numbers and backreferencing

Pattern matching and extraction

  • matching/extracting sequence of characters
  • use re.search() to see if a string contains a pattern or not
  • use re.findall() to get a list of matching patterns
  • use re.split() to get a list from splitting a string based on a pattern
  • their syntax given below
re.search(pattern, string, flags=0)
re.findall(pattern, string, flags=0)
re.split(pattern, string, maxsplit=0, flags=0)
>>> import re
>>> string = "This is a sample string"

>>> bool(re.search('is', string))
True

>>> bool(re.search('this', string))
False

>>> bool(re.search('this', string, re.I))
True

>>> bool(re.search('T', string))
True

>>> bool(re.search('is a', string))
True

>>> re.findall('i', string)
['i', 'i', 'i']
  • using regular expressions
  • use the r'' format when using regular expression elements
>>> string
'This is a sample string'

>>> re.findall('is', string)
['is', 'is']

>>> re.findall('\bis', string)
[]

>>> re.findall(r'\bis', string)
['is']

>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'sample', 'string']

>>> re.split(r'\s+', string)
['This', 'is', 'a', 'sample', 'string']

>>> re.split(r'\d+', 'Sample123string54with908numbers')
['Sample', 'string', 'with', 'numbers']

>>> re.split(r'(\d+)', 'Sample123string54with908numbers')
['Sample', '123', 'string', '54', 'with', '908', 'numbers']
  • backreferencing
>>> quote = "So many books, so little time"

>>> re.search(r'([a-z]{2,}).*\1', quote, re.I)
<_sre.SRE_Match object; span=(0, 17), match='So many books, so'>

>>> re.search(r'([a-z])\1', quote, re.I)
<_sre.SRE_Match object; span=(9, 11), match='oo'>

>>> re.findall(r'([a-z])\1', quote, re.I)
['o', 't']

Search and Replace

Syntax

re.sub(pattern, repl, string, count=0, flags=0)
  • simple substitutions
  • re.sub will not change value of variable passed to it, has to be explicity assigned
>>> sentence = 'This is a sample string'
>>> re.sub('sample', 'test', sentence)
'This is a test string'

>>> sentence
'This is a sample string'
>>> sentence = re.sub('sample', 'test', sentence)
>>> sentence
'This is a test string'

>>> re.sub('/', '-', '25/06/2016')
'25-06-2016'
>>> re.sub('/', '-', '25/06/2016', count=1)
'25-06/2016'

>>> greeting = '***** Have a great day *****'
>>> re.sub('\*', '=', greeting)
'===== Have a great day ====='
  • backreferencing
>>> words = 'night and day'
>>> re.sub(r'(\w+)( \w+ )(\w+)', r'\3\2\1', words)
'day and night'

>>> line = 'Can you spot the the mistakes? I i seem to not'
>>> re.sub(r'\b(\w+) \1\b', r'\1', line, flags=re.I)
'Can you spot the mistakes? I seem to not'
  • using functions in replace part of re.sub()
>>> import math
>>> numbers = '1 2 3 4 5'

>>> def fact_num(n):
...     return str(math.factorial(int(n.group(1))))
... 
>>> re.sub(r'(\d+)', fact_num, numbers)
'1 2 6 24 120'

>>> re.sub(r'(\d+)', lambda m: str(math.factorial(int(m.group(1)))), numbers)
'1 2 6 24 120'

Compiling Regular Expressions

>>> swap_words = re.compile(r'(\w+)( \w+ )(\w+)')
>>> swap_words
re.compile('(\\w+)( \\w+ )(\\w+)')

>>> words = 'night and day'

>>> swap_words.search(words).group()
'night and day'
>>> swap_words.search(words).group(1)
'night'
>>> swap_words.search(words).group(2)
' and '
>>> swap_words.search(words).group(3)
'day'
>>> swap_words.search(words).group(4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

>>> bool(swap_words.search(words))
True
>>> swap_words.findall(words)
[('night', ' and ', 'day')]

>>> swap_words.sub(r'\3\2\1', words)
'day and night'
>>> swap_words.sub(r'\3\2\1', 'yin and yang')
'yang and yin'

Further Reading on Regular Expressions