RegEx basics

Quantifiers

? = zero or one

+ = one or more

* = zero or more

{3} = 3 literal characters

{3,} = 3 or more

{1,3} = between 1 and 3 characters

An example using {1,3}:


Collections and Negation

[ ] = one of any of these characters

[^] = using the caret (^) inside of a collection means anything BUT these letters

So this regular expression means anything BUT the letters a-z, or the number 4:

[^a-z4]

So let’s use this and match ANY properly punctuated sentence with the following:

[A-Z][^\.?!]+[\.?!]

Explanation:

  • [A-Z] = exactly one capital letter A-Z
  • [^   ]+ = negation, one or more of any characters NOT in this collection
  • \.?! = the literal . character, the ? character, or the ! character

Result:


Whitespace Characters

\t = tab

\n = new line

\r = carriage return

\f = line feed

\v = vertical tab

New lines from Windows have two characters: \r\n
New lines for macOS/Linux: \n

Anchoring

You can anchor expressions to the start or the end of a string:

^ = anchor at the start of the string

$ = anchor at the end of the string

Anchoring at the start:

Anchoring at the end:


Character classes

Single tokens which can represent a wide variety of characters (with some commonality)

. = means any character except a new line

\s = any kind of whitespace

\S = the inverse of \s, or any kind of character that is NOT whitespace

\d = any digit character [0-9]

\D = any non-digit character [^0-9]

\w = word character [0-9A-Za-z_]

\W = any non-word character [^0-9A-Za-z_]

\b = word boundary (before the \b is \w and after is a \W, or vice versa]

\B = not a word boundary

So to combine anchors with character classes, let’s find a string that starts with a word character and ends with anything that ISN’T a digit:

^\w.*\D$

Result:


RegEx Examples (useful for log file searching)

Search for everything up to, but not including “abc”

.*?(?=abc)

Result:

Remove all blank spaces at the beginning of lines

Find:

^\s*(\w.*)$

Replace with:

\1

Explanation:

  • ^ – Beginning of the line
  • \s – A whitespace character
  • * – 0 or more of them
  • ( – Begin a capture group
  • \w – A word character ([a-z] or [A-Z])
  • . – Any character
  • * – 0 or more of them
  • ) – End our capture group
  • $ – End of the line
  • \1 – The contents of the first capture group (Our word character and all characters after that)