Regular expression types 

This is what confuses a lot of people. People tend to think that there is one engine for regex in PHP. However there are 2 types of  regular expressions:

  • POSIX Extended
  • Perl Compatible (PCRE)

The ereg() , eregi() , … are the POSIX versions and preg_match() , preg_replace() , … are the PCRE (Perl Compatible Regular Expression). Using Perl compatible regular expressions the expression should be enclosed in the delimiters, a forward slash (/) for example. Note that PCRE version is more powerful and faster than the POSIX. But more about it later.

PHP PCRE regex syntax is describe here . It is important to know how the regular expression works and which function uses which regex engine.

The functions that use POSIX compliant rules are:The functions that use Perl compatibility rules (PCRE) are:
reg_replace()
ereg()
eregi()
eregi_replace()
split()
spliti()
sql_regcase()
mb_ereg_match()
mb_ereg_replace()
mb_ereg_search_getpos()
mb_ereg_search_getregs()
mb_ereg_search_init()
mb_ereg_search_pos()
mb_ereg_search_regs()
mb_ereg_search_setpos()
mb_ereg_search()
mb_ereg()
mb_eregi_replace()
mb_eregi()
mb_regex_encoding()
mb_regex_set_options()
mb_split()
preg_grep()
preg_replace_callback()
preg_match_all()
preg_match()
preg_quote()
preg_split()
preg_replace()

Use the right tool for the job. PHP include a wide range of mb_ functions which support multibyte character which are composed from more than 8 bits.

PHP Regex Operators

OperatorDescription
^It denotes the start of string.
$It denotes the end of string.
.It denotes almost any single character.
()It denotes a group of expressions.
[]It finds a range of characters for example [xyz] means x, y or z .
[^]It finds the items which are not in range for example [^abc] means NOT a, b or c.
(dash)It finds for character range within the given item range for example [a-z] means a through z.
| (pipe)It is the logical OR for example x | y means x OR y.
?It denotes zero or one of preceding character or item range.
*It denotes zero or more of preceding character or item range.
+It denotes one or more of preceding character or item range.
{n}It denotes exactly n times of preceding character or item range for example n{2}.
{n, }It denotes atleast n times of preceding character or item range for example n{2, }.
{n, m}It denotes atleast n but not more than m times for example n{2, 4} means 2 to 4 of n.
\It denotes the escape character.
More paternsmatchedo not march
(big) (?=world)big worldbig and world
(big) (?!world)big apple big world
(?<=big) (world)big world small world
(?<!big) (world)small worldbig world
(?:pattern)

Example (?:\b\w{3}\b) will not include all words with length of 3 in the output. We can also replace the matched values with some string:

\b stands for word boudris
\w matches any word character (equivalent to [a-zA-Z0-9_])

Play with your regex here: https://regex101.com/

Predefined Character Classes

Some character classes such as digits, letters, and whitespaces are used so frequently that there are shortcut names for them. The following table lists those predefined character classes:

ShortcutWhat it Does
.Matches any single character except newline \n.
\dmatches any digit character. Same as [0-9]
\DMatches any non-digit character. Same as [^0-9]
\sMatches any whitespace character (space, tab, newline or carriage return character). Same as [ \t\n\r]
\SMatches any non-whitespace character. Same as [^ \t\n\r]
\wMatches any word character (definned as a to z, A to Z,0 to 9, and the underscore). Same as [a-zA-Z_0-9]
\WMatches any non-word character. Same as [^a-zA-Z_0-9]
POSIX compliant Regular: Perl-compatible regular:
[:upper:] Matches all uppercase letters
[:lower:] matches all lowercase letters
[:alpha:] Matches all letters
[:alnum:] Matches all letters and numbers
[:digit:] Matches all numbers
[:xdigit:] Matches all hexadecimal characters, equivalent to [0-9a-fa-f]
[:punct:] Matches all punctuation, equivalent to [., “‘?!;:]
[:blank:] Match space and tab, equivalent to [\ t]
[:space:] Matches all whitespace characters, equivalent to [\t\n\r\f\v]
[:cntrl:] Matches all the ASCII 0 to 31 control characters.
[:graph:] matches all printable characters, equivalent to: [^ \t\n\r\f\v]
[:print:] Matches all printable characters and spaces, equivalent to: [^\t\n\r\f\v]
[.C.] Unknown function
[=c=] Unknown function
[:<:] Matches the beginning of a word
[:;:] Matches the end of a word
\a Alarm, which is the BEL character (‘ 0)
\cx “Control-x”, where x is any character
\e Escape (‘ 0B)
\f page Break formfeed (‘ 0C)
\ n line break newline (‘ 0A)
\ r return character carriage return (‘ 0D)
\ t Tab tab (‘ 0)
\XHH hexadecimal code for HH characters
\DDD octal code for DDD characters, or backreference
\d any decimal digit
\d a character of any non-decimal number
\s any whitespace character
\s any non-whitespace character
\w Characters of any “word”
\w any “non-word” character
\b Word Dividing line
\b Non-word dividing line
\a the beginning of the target (independent of multiline mode)
\z the end of a target or a newline match either at the end (independent of multiline mode)
\z End of Target (independent of multiline mode)
\g the first matching position in a target
Special CharacterMeaning
\nIt denotes a new line.
\rIt denotes a carriage return.
\tIt denotes a tab.
\vIt denotes a vertical tab.
\fIt denotes a form feed.
\xxxIt denotes octal character xxx.
\xhhIt denotes hex character hh.

Example of use with preg_match().

Demo data string URL: www.exampe.com/id/1234 run regex that gets the value of the id parameter which in this case is 1234.
The correct lookbehind expression would be ((?<=id/)\d+), but you really shouldn’t use lookbehind unless you need it.

Alternatives to look behind / ahead. You can use parentheses to catch specific string for example preg_match('(id/(\d+))', $url, $matches) without any lookbehind. The result will be in $matches[1].

Another alternative is (id/\K\d+) (\K resets the match start and is often used as a more powerful lookbehind since look behind can only apply very basic expressions. Note that this have no use for lookahead).

PCRE regular expression modifiers

Regex modifiers can change the way that regex expression behave. There are quite few modifiers available for regular expression but the one we end up using quite often are i, m, u modifiers. Keep reading to get to know what does do.

The current possible PCRE modifiers are listed below. The names in parentheses refer to internal PCRE names for these modifiers. Spaces and newlines are ignored in modifiers, other characters cause error.

  • i (PCRE_CASELESS)
    If this modifier is set, letters in the pattern match both upper and lower case letters.
  • m (PCRE_MULTILINE)
    By default, PCRE treats the subject string as consisting of a single “line” of characters (even if it actually contains several newlines). The “start of line” metacharacter (^) matches only at the start of the string, while the “end of line” metacharacter ($) matches only at the end of the string, or before a terminating newline (unless D modifier is set). This is the same as Perl. When this modifier is set, the “start of line” and “end of line” constructs match immediately following or immediately before any newline in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl’s /m modifier. If there are no “\n” characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect.
  • s (PCRE_DOTALL)
    If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl’s /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.
  • x (PCRE_EXTENDED)
    If this modifier is set, whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl’s /x modifier, and makes it possible to include commentary inside complicated patterns. Note, however, that this applies only to data characters. Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence (?( which introduces a conditional subpattern.
  • A (PCRE_ANCHORED)
    If this modifier is set, the pattern is forced to be “anchored”, that is, it is constrained to match only at the start of the string which is being searched (the “subject string”). This effect can also be achieved by appropriate constructs in the pattern itself, which is the only way to do it in Perl.
  • D (PCRE_DOLLAR_ENDONLY)
    If this modifier is set, a dollar metacharacter in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set. There is no equivalent to this modifier in Perl.
  • S
    When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.
  • U (PCRE_UNGREEDY)
    This modifier inverts the “greediness” of the quantifiers so that they are not greedy by default, but become greedy if followed by ?. It is not compatible with Perl. It can also be set by a (?U) modifier setting within the pattern or by a question mark behind a quantifier (e.g. .*?).
    Note: It is usually not possible to match more than pcre.backtrack_limit characters in ungreedy mode.
  • X (PCRE_EXTRA)
    This modifier turns on additional functionality of PCRE that is incompatible with Perl. Any backslash in a pattern that is followed by a letter that has no special meaning causes an error, thus reserving these combinations for future expansion. By default, as in Perl, a backslash followed by a letter with no special meaning is treated as a literal. There are at present no other features controlled by this modifier.
  • J (PCRE_INFO_JCHANGED)
    The (?J) internal option setting changes the local PCRE_DUPNAMES option. Allow duplicate names for subpatterns. As of PHP 7.2.0 J is supported as modifier as well.
  • u (PCRE_UTF8)
    This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid.

Table of Contents ¶

FunctionDefinition
preg_match()This function searches for a specific pattern against some string. It returns true if pattern exists and false otherwise.
preg_match_all()This function searches for all the occurrences of string pattern against the string. This function is very useful for search and replace.
ereg_replace()This function searches for specific string pattern and replace the original string with the replacement string, if found.
eregi_replace()The function behaves like ereg_replace() provided the search for pattern is not case sensitive.
preg_replace()This function behaves like ereg_replace() function provided the regular expressions can be used in the pattern and replacement strings.
preg_split()The function behaves like the PHP split() function. It splits the string by regular expressions as its paramaters.
preg_grep()This function searches all elements which matches the regular expression pattern and returns the output array.
preg_quote()This function takes string and quotes in front of every character which matches the regular expression.
ereg()This function searches for a string which is specified by a pattern and returns true if found, otherwise returns false.
eregi()This function behaves like ereg() function provided the search is not case sensitive.

How to use preg_replace_callback() function – example:

<?php

function preg_replace_nth($pattern, $replacement, $subject, $nth=1) {
    return preg_replace_callback($pattern,
        function($found) use (&$pattern, &$replacement, &$nth) {
                $nth--;
                if ($nth==0) return preg_replace($pattern, $replacement, reset($found) );
                return reset($found);
        }, $subject,$nth  );
}

echo preg_replace_nth("/(\w+)\|/", '${1} is the 4th|', "|aa|b|cc|dd|e|ff|gg|kkk|", 4);

// src: https://www.php.net/manual/en/function.preg-replace.php#112400

PHP Docs: https://www.php.net/manual/en/book.pcre.php
Per alike but not the same: https://www.php.net/manual/en/reference.pcre.pattern.differences.php
Look behind: https://stackoverflow.com/questions/8837676/how-to-use-regex-look-behind
PCRE Regex CheatSheet:
POSIX Cheat Sheet: https://finchan.files.wordpress.com/2010/02/reg.png
Difference between PCRE and POSIX: https://topic.alibabacloud.com/a/parsing-posix-vs-perl-standard-regular-expression-differences-_php-tutorial_4_86_30966984.html

0
Would love your thoughts, please comment.x
()
x