Shrikant Malviya    Publication    Tutorials    CV

Introduction to Regular Expression

What will see in this article:
  • What is Regular Expression?
  • Some simple examples of RE.

What is Regular Expression?

Basically a regular expression is a way to search through a string of text.

 cat : the fat cat ran down the street.
  at : the fat cat ran down the street.
        it was searching for a mouse to eat.

Some Examples:

e+ can be used to search multiple e's in a row:

  e : the fat cat ran down the street.
  e+ : the fat cat ran down the street.

ea? can be used to search 'e' or 'ea' in a row, because whatever before '?' that is optional:

  ea? : the fat cat ran down the street.
      It was searching for a mouse to eat.

re* can be used to search 'r' or 're' or 'ree' and so on in a row, as '*' says 'e' could occure zero or more times:

  re* : the fat cat ran down the street.
      It was searching for a mouse to eat.

.at can be used to search any word that ends with at. "." denotes any character:

  .at : the fat cat ran down the street.
      It was searching for a mouse to eat.

Use of multiple periods t.., . does not match newline character "\n":

  t.. : the fat cat ran down the street.
      It was searching for a mouse to eat.

Search for . periods (use escape character "\" before the period ".":

  \. : The fat cat ran down the street.
      It was searching for a mouse to eat.

Match any word character with "\w". \W is going to match anything that is not a valid character:

\s is going to match anykind of whitespace [\r\t\s] with "\s". The opposite of it \S going to march anything that is not whitespace:

/w{a,b} matches all the words whose length is between a and b:

We can match any 3 char words starts with f or c ends with at by using "[fc]at":

We can match any 3 char words starts with any char a-zA-Z ends with at by using "[a-zA-Z]at":

We can also use parentheses in order to put several groups with or condition. Ex. the following regular expression will search both The and the:

If we want to select 2 or 3 of (t|e|r):

If we want to select 2 or 3 of (t|e|r) ends with a period ".":

To select a group re being occured 2-3 times, use following regular expression:

^ is used to search a pattern at the very beginning of the entire chunk of text. Similarly the $ does the same but at the end of the chunk:

Positive look-behind. Below example search what is behind the The or the by (?<=[t|T]he):

Negative look-behind. Below example search everything that is not behind The or the by (?, means every single character except the two spaces:

Positive look-ahead. Below example search what is ahead of at by .(?=at):

Negative look-ahead. Below example search everything that is not ahead of at by .(?!at), everything except f, c and e which are ahead of at:

Some real applications:

Find a 10 digit phone-number by \d{10} or \d{3}-?\d{3}-?\d{4} or \d{3}[- ]?\d{3}[- ]?\d{4} for numbers have dashes in it:

Group each part of the phone number in a separate group. Use prantheses for grouping Ex (\d{3})[ -]?(\d{3})[ -]?(\d{4}):

How to give a name to a specific group Ex (?\d{3})[ -]?(\d{3})[ -]?(\d{4}):

Optional parentheses in the area-code. Ex \(?(\d{3})\)?[ -]?(\d{3})[ -]?(\d{4}):

Add internation code as well with considering the non-capturing gropu property (?:____). Ex (?:(\+1)[ -]?)\(?(\d{3})\)?[ -]?(\d{3})[ -]?(\d{4}):