PHP中的POSIX Extended正则表达式
目录
正则表达式
《计算理论》中学过正则表达式的知识,但从没用过。这回看JavaScript和PHP的教程都提到正则表达式,算是懂了一点儿如何使用这个工具。
使用正则表达式,处理字符串十分方便快捷。正如书上说的,使用简单,但编写正则表达式却不容易。
PHP中使用两种正则表达式表示法:POSIX Extended和PCRE。书中只介绍了较为简单的POSIX Extended表示法。
在计算理论中,每个正则表达式都表示一个语言,程序中将该正则表达式成为一个模式(Pattern)。如下面就是一个模式示例:
上述正则表达式共有四种成分:
- 字面值(literal)
精确匹配。如上式中的 pattern 。 - 元字符(metacharacter)
具有特定含义的符号。如上式中的 ^ \ () + ? &等。
下面列表列出可用的元字符(包括限定符),摘自Wiki百科 Regular expression 条目。
POSIX Basic Regular Expressions
<th>
Description
</th>
<td>
Matches any single character (many applications exclude <a title="Newline" href="//en.wikipedia.org/wiki/Newline">newlines</a>, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, <code>a.c</code> matches “<em>abc</em>“, etc., but <code>[a.c]</code> matches only “<em>a</em>“, “<em>.</em>“, or “<em>c</em>“.
</td>
<td>
A bracket expression. Matches a single character that is contained within the brackets. For example, <code>[abc]</code>matches “<em>a</em>“, “<em>b</em>“, or “<em>c</em>“. <code>[a-z]</code> specifies a range which matches any lowercase letter from “<em>a</em>” to “<em>z</em>“. These forms can be mixed: <code>[abcx-z]</code> matches “<em>a</em>“, “<em>b</em>“, “<em>c</em>“, “<em>x</em>“, “<em>y</em>“, or “<em>z</em>“, as does <code>[a-cx-z]</code>.The <code>-</code> character is treated as a literal character if it is the last or the first (after the <code>^</code>) character within the brackets: <code>[abc-]</code>, <code>[-abc]</code>. Note that backslash escapes are not allowed. The <code>]</code> character can be included in a bracket expression if it is the first (after the <code>^</code>) character: <code>[]abc]</code>.
</td>
<td>
Matches a single character that is not contained within the brackets. For example, <code>[^abc]</code> matches any character other than “<em>a</em>“, “<em>b</em>“, or “<em>c</em>“. <code>[^a-z]</code> matches any single character that is not a lowercase letter from “<em>a</em>” to “<em>z</em>“. Likewise, literal characters and ranges can be mixed.
</td>
<td>
Matches the starting position within the string. In line-based tools, it matches the starting position of any line.
</td>
<td>
Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.
</td>
<td>
Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, <code>\<em>n</em></code>). A marked subexpression is also called a block or capturing group.
</td>
<td>
Matches what the <em>n</em>th marked subexpression matched, where <em>n</em> is a digit from 1 to 9. This construct is theoretically <strong>irregular</strong> and was not adopted in the POSIX ERE syntax. Some tools allow referencing more than nine capturing groups.
</td>
<td>
Matches the preceding element zero or more times. For example, <code>ab*c</code> matches “<em>ac</em>“, “<em>abc</em>“, “<em>abbbc</em>“, etc. <code>[xyz]*</code>matches “”, “<em>x</em>“, “<em>y</em>“, “<em>z</em>“, “<em>zx</em>“, “<em>zyx</em>“, “<em>xyzzy</em>“, and so on. <code>\(ab\)*</code> matches “”, “<em>ab</em>“, “<em>abab</em>“, “<em>ababab</em>“, and so on.
</td>
<td>
Matches the preceding element at least <em>m</em> and not more than <em>n</em> times. For example, <code>a\{3,5\}</code> matches only “<em>aaa</em>“, “<em>aaaa</em>“, and “<em>aaaaa</em>“. This is not found in a few older instances of regular expressions.
</td>
POSIX Extended Regular Expressions
<th>
Description
</th>
<td>
Matches the preceding element zero or one time. For example, <code>ba?</code> matches “<em>b</em>” or “<em>ba</em>“.
</td>
<td>
Matches the preceding element one or more times. For example, <code>ba+</code> matches “<em>ba</em>“, “<em>baa</em>“, “<em>baaa</em>“, and so on.
</td>
<td>
The choice (aka alternation or set union) operator matches either the expression before or the expression after the operator. For example, <code>abc|def</code> matches “<em>abc</em>” or “<em>def</em>“.
</td>
3. 限定符
指出一个部分(或模式)的出现次数。如上式中的 { } + * 等。
4. 字符类
将字符放在方括号[]中可以创造字符类。用于说明字符的种类。如上式中的 [0-9] [a-z]等。
下面列表列出可用的元字符(包括限定符),摘自Wiki百科 Regular expression 条目。
<th>
Non-standard
</th>
<th>
Perl
</th>
<th>
ASCII
</th>
<th>
Description
</th>
<td>
</td>
<td>
</td>
<td>
<code>[A-Za-z0-9]</code>
</td>
<td>
Alphanumeric characters
</td>
<td>
<code>[:word:]</code>
</td>
<td>
<code>\w</code>
</td>
<td>
<code>[A-Za-z0-9_]</code>
</td>
<td>
Alphanumeric characters plus “_”
</td>
<td>
</td>
<td>
<code>\W</code>
</td>
<td>
<code>[^A-Za-z0-9_]</code>
</td>
<td>
Non-word characters
</td>
<td>
</td>
<td>
</td>
<td>
<code>[A-Za-z]</code>
</td>
<td>
Alphabetic characters
</td>
<td>
</td>
<td>
</td>
<td>
<code>[ <a title="\t" href="//en.wikipedia.org/wiki/%5Ct">\t</a>]</code>
</td>
<td>
Space and tab
</td>
<td>
</td>
<td>
<code>\b</code>
</td>
<td>
<code>[(?<=\W)(?=\w)|(?<=\w)(?=\W)]</code>
</td>
<td>
Word boundaries
</td>
<td>
</td>
<td>
</td>
<td>
<code>[\x00-\x1F\x7F]</code>
</td>
<td>
<a title="Control character" href="//en.wikipedia.org/wiki/Control_character">Control characters</a>
</td>
<td>
</td>
<td>
<code>\d</code>
</td>
<td>
<code>[0-9]</code>
</td>
<td>
Digits
</td>
<td>
</td>
<td>
<code>\D</code>
</td>
<td>
<code>[^0-9]</code>
</td>
<td>
Non-digits
</td>
<td>
</td>
<td>
</td>
<td>
<code>[\x21-\x7E]</code>
</td>
<td>
Visible characters
</td>
<td>
</td>
<td>
</td>
<td>
<code>[a-z]</code>
</td>
<td>
Lowercase letters
</td>
<td>
</td>
<td>
</td>
<td>
<code>[\x20-\x7E]</code>
</td>
<td>
Visible characters and the space character
</td>
<td>
</td>
<td>
</td>
<td>
<code>[\]\[!"#$%&'()*+,./:;<=>?@\^_`{|}~-]</code>
</td>
<td>
Punctuation characters
</td>
<td>
</td>
<td>
<code>\s</code>
</td>
<td>
<code>[ <a title="\t" href="//en.wikipedia.org/wiki/%5Ct">\t</a><a title="\r" href="//en.wikipedia.org/wiki/%5Cr">\r</a><a title="\n" href="//en.wikipedia.org/wiki/%5Cn">\n</a><a title="\v" href="//en.wikipedia.org/wiki/%5Cv">\v</a><a title="\f" href="//en.wikipedia.org/wiki/%5Cf">\f</a>]</code>
</td>
<td>
<a title="Whitespace character" href="//en.wikipedia.org/wiki/Whitespace_character">Whitespace characters</a>
</td>
<td>
</td>
<td>
<code>\S</code>
</td>
<td>
<code>[^ \t\r\n\v\f]</code>
</td>
<td>
Non-whitespace characters
</td>
<td>
</td>
<td>
</td>
<td>
<code>[A-Z]</code>
</td>
<td>
Uppercase letters
</td>
<td>
</td>
<td>
</td>
<td>
<code>[A-Fa-f0-9]</code>
</td>
<td>
Hexadecimal digits
</td>
以上四个部分构成最终的正则表达式。
-–-–-–-–-–-–-–-–-–-–-–-–-–-–-–-–-–-–-–-—
注意
PHP的后续版本可能不再支持POSIX Extended的正则表达式,不赞成继续使用,应使用Perl兼容的正则表达式。
参见
《Function ereg() is deprecated Error 错误对策》