PHP中的POSIX Extended正则表达式

目录

正则表达式

《计算理论》中学过正则表达式的知识,但从没用过。这回看JavaScript和PHP的教程都提到正则表达式,算是懂了一点儿如何使用这个工具。
使用正则表达式,处理字符串十分方便快捷。正如书上说的,使用简单,但编写正则表达式却不容易。
PHP中使用两种正则表达式表示法:POSIX Extended和PCRE。书中只介绍了较为简单的POSIX Extended表示法。
在计算理论中,每个正则表达式都表示一个语言,程序中将该正则表达式成为一个模式(Pattern)。如下面就是一个模式示例:

上述正则表达式共有四种成分:

  1. 字面值(literal)
    精确匹配。如上式中的 pattern 。
  2. 元字符(metacharacter)
    具有特定含义的符号。如上式中的 ^ \ () + ? &等。
    下面列表列出可用的元字符(包括限定符),摘自Wiki百科 Regular expression 条目。

POSIX Basic Regular Expressions

<th>
  Description
</th>
<td>
  Matches any single character (many applications exclude <a title="Newline" href="//en.wikipedia.org/wiki/Newline">newlines</a>, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, <code>a.c</code> matches &#8220;<em>abc</em>&#8220;, etc., but <code>[a.c]</code> matches only &#8220;<em>a</em>&#8220;, &#8220;<em>.</em>&#8220;, or &#8220;<em>c</em>&#8220;.
</td>
<td>
  A bracket expression. Matches a single character that is contained within the brackets. For example, <code>[abc]</code>matches &#8220;<em>a</em>&#8220;, &#8220;<em>b</em>&#8220;, or &#8220;<em>c</em>&#8220;. <code>[a-z]</code> specifies a range which matches any lowercase letter from &#8220;<em>a</em>&#8221; to &#8220;<em>z</em>&#8220;. These forms can be mixed: <code>[abcx-z]</code> matches &#8220;<em>a</em>&#8220;, &#8220;<em>b</em>&#8220;, &#8220;<em>c</em>&#8220;, &#8220;<em>x</em>&#8220;, &#8220;<em>y</em>&#8220;, or &#8220;<em>z</em>&#8220;, as does <code>[a-cx-z]</code>.The <code>-</code> character is treated as a literal character if it is the last or the first (after the <code>^</code>) character within the brackets: <code>[abc-]</code>, <code>[-abc]</code>. Note that backslash escapes are not allowed. The <code>]</code> character can be included in a bracket expression if it is the first (after the <code>^</code>) character: <code>[]abc]</code>.
</td>
<td>
  Matches a single character that is not contained within the brackets. For example, <code>[^abc]</code> matches any character other than &#8220;<em>a</em>&#8220;, &#8220;<em>b</em>&#8220;, or &#8220;<em>c</em>&#8220;. <code>[^a-z]</code> matches any single character that is not a lowercase letter from &#8220;<em>a</em>&#8221; to &#8220;<em>z</em>&#8220;. Likewise, literal characters and ranges can be mixed.
</td>
<td>
  Matches the starting position within the string. In line-based tools, it matches the starting position of any line.
</td>
<td>
  Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.
</td>
<td>
  Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, <code>\&lt;em>n&lt;/em></code>). A marked subexpression is also called a block or capturing group.
</td>
<td>
  Matches what the <em>n</em>th marked subexpression matched, where <em>n</em> is a digit from 1 to 9. This construct is theoretically <strong>irregular</strong> and was not adopted in the POSIX ERE syntax. Some tools allow referencing more than nine capturing groups.
</td>
<td>
  Matches the preceding element zero or more times. For example, <code>ab*c</code> matches &#8220;<em>ac</em>&#8220;, &#8220;<em>abc</em>&#8220;, &#8220;<em>abbbc</em>&#8220;, etc. <code>[xyz]*</code>matches &#8220;&#8221;, &#8220;<em>x</em>&#8220;, &#8220;<em>y</em>&#8220;, &#8220;<em>z</em>&#8220;, &#8220;<em>zx</em>&#8220;, &#8220;<em>zyx</em>&#8220;, &#8220;<em>xyzzy</em>&#8220;, and so on. <code>\(ab\)*</code> matches &#8220;&#8221;, &#8220;<em>ab</em>&#8220;, &#8220;<em>abab</em>&#8220;, &#8220;<em>ababab</em>&#8220;, and so on.
</td>
<td>
  Matches the preceding element at least <em>m</em> and not more than <em>n</em> times. For example, <code>a\{3,5\}</code> matches only &#8220;<em>aaa</em>&#8220;, &#8220;<em>aaaa</em>&#8220;, and &#8220;<em>aaaaa</em>&#8220;. This is not found in a few older instances of regular expressions.
</td>

POSIX Extended Regular Expressions

<th>
  Description
</th>
<td>
  Matches the preceding element zero or one time. For example, <code>ba?</code> matches &#8220;<em>b</em>&#8221; or &#8220;<em>ba</em>&#8220;.
</td>
<td>
  Matches the preceding element one or more times. For example, <code>ba+</code> matches &#8220;<em>ba</em>&#8220;, &#8220;<em>baa</em>&#8220;, &#8220;<em>baaa</em>&#8220;, and so on.
</td>
<td>
  The choice (aka alternation or set union) operator matches either the expression before or the expression after the operator. For example, <code>abc|def</code> matches &#8220;<em>abc</em>&#8221; or &#8220;<em>def</em>&#8220;.
</td>

 
3. 限定符
指出一个部分(或模式)的出现次数。如上式中的 { } + * 等。
4. 字符类
将字符放在方括号[]中可以创造字符类。用于说明字符的种类。如上式中的 [0-9] [a-z]等。
下面列表列出可用的元字符(包括限定符),摘自Wiki百科 Regular expression 条目。

<th>
  Non-standard
</th>

<th>
  Perl
</th>

<th>
  ASCII
</th>

<th>
  Description
</th>
<td>
</td>

<td>
</td>

<td>
  <code>[A-Za-z0-9]</code>
</td>

<td>
  Alphanumeric characters
</td>
<td>
  <code>[:word:]</code>
</td>

<td>
  <code>\w</code>
</td>

<td>
  <code>[A-Za-z0-9_]</code>
</td>

<td>
  Alphanumeric characters plus &#8220;_&#8221;
</td>
<td>
</td>

<td>
  <code>\W</code>
</td>

<td>
  <code>[^A-Za-z0-9_]</code>
</td>

<td>
  Non-word characters
</td>
<td>
</td>

<td>
</td>

<td>
  <code>[A-Za-z]</code>
</td>

<td>
  Alphabetic characters
</td>
<td>
</td>

<td>
</td>

<td>
  <code>[ &lt;a title="\t" href="//en.wikipedia.org/wiki/%5Ct">\t&lt;/a>]</code>
</td>

<td>
  Space and tab
</td>
<td>
</td>

<td>
  <code>\b</code>
</td>

<td>
  <code>[(?&lt;=\W)(?=\w)|(?&lt;=\w)(?=\W)]</code>
</td>

<td>
  Word boundaries
</td>
<td>
</td>

<td>
</td>

<td>
  <code>[\x00-\x1F\x7F]</code>
</td>

<td>
  <a title="Control character" href="//en.wikipedia.org/wiki/Control_character">Control characters</a>
</td>
<td>
</td>

<td>
  <code>\d</code>
</td>

<td>
  <code>[0-9]</code>
</td>

<td>
  Digits
</td>
<td>
</td>

<td>
  <code>\D</code>
</td>

<td>
  <code>[^0-9]</code>
</td>

<td>
  Non-digits
</td>
<td>
</td>

<td>
</td>

<td>
  <code>[\x21-\x7E]</code>
</td>

<td>
  Visible characters
</td>
<td>
</td>

<td>
</td>

<td>
  <code>[a-z]</code>
</td>

<td>
  Lowercase letters
</td>
<td>
</td>

<td>
</td>

<td>
  <code>[\x20-\x7E]</code>
</td>

<td>
  Visible characters and the space character
</td>
<td>
</td>

<td>
</td>

<td>
  <code>[\]\[!"#$%&'()*+,./:;&lt;=>?@\^_`{|}~-]</code>
</td>

<td>
  Punctuation characters
</td>
<td>
</td>

<td>
  <code>\s</code>
</td>

<td>
  <code>[ &lt;a title="\t" href="//en.wikipedia.org/wiki/%5Ct">\t&lt;/a>&lt;a title="\r" href="//en.wikipedia.org/wiki/%5Cr">\r&lt;/a>&lt;a title="\n" href="//en.wikipedia.org/wiki/%5Cn">\n&lt;/a>&lt;a title="\v" href="//en.wikipedia.org/wiki/%5Cv">\v&lt;/a>&lt;a title="\f" href="//en.wikipedia.org/wiki/%5Cf">\f&lt;/a>]</code>
</td>

<td>
  <a title="Whitespace character" href="//en.wikipedia.org/wiki/Whitespace_character">Whitespace characters</a>
</td>
<td>
</td>

<td>
  <code>\S</code>
</td>

<td>
  <code>[^ \t\r\n\v\f]</code>
</td>

<td>
  Non-whitespace characters
</td>
<td>
</td>

<td>
</td>

<td>
  <code>[A-Z]</code>
</td>

<td>
  Uppercase letters
</td>
<td>
</td>

<td>
</td>

<td>
  <code>[A-Fa-f0-9]</code>
</td>

<td>
  Hexadecimal digits
</td>

以上四个部分构成最终的正则表达式。
-–-–-–-–-–-–-–-–-–-–-–-–-–-–-–-–-–-–-–-—

注意

PHP的后续版本可能不再支持POSIX Extended的正则表达式,不赞成继续使用,应使用Perl兼容的正则表达式。
参见
《Function ereg() is deprecated Error 错误对策》