[ Index ]

PHP Cross Reference of Unnamed Project

title

Body

[close]

/se3-unattended/var/se3/unattended/install/linuxaux/opt/perl/lib/5.10.0/pod/ -> perlrecharclass.pod (source)

   1  =head1 NAME
   2  
   3  perlrecharclass - Perl Regular Expression Character Classes
   4  
   5  =head1 DESCRIPTION
   6  
   7  The top level documentation about Perl regular expressions
   8  is found in L<perlre>.
   9  
  10  This manual page discusses the syntax and use of character
  11  classes in Perl Regular Expressions.
  12  
  13  A character class is a way of denoting a set of characters,
  14  in such a way that one character of the set is matched.
  15  It's important to remember that matching a character class
  16  consumes exactly one character in the source string. (The source
  17  string is the string the regular expression is matched against.)
  18  
  19  There are three types of character classes in Perl regular
  20  expressions: the dot, backslashed sequences, and the bracketed form.
  21  
  22  =head2 The dot
  23  
  24  The dot (or period), C<.> is probably the most used, and certainly
  25  the most well-known character class. By default, a dot matches any
  26  character, except for the newline. The default can be changed to
  27  add matching the newline with the I<single line> modifier: either
  28  for the entire regular expression using the C</s> modifier, or
  29  locally using C<(?s)>.
  30  
  31  Here are some examples:
  32  
  33   "a"  =~  /./       # Match
  34   "."  =~  /./       # Match
  35   ""   =~  /./       # No match (dot has to match a character)
  36   "\n" =~  /./       # No match (dot does not match a newline)
  37   "\n" =~  /./s      # Match (global 'single line' modifier)
  38   "\n" =~  /(?s:.)/  # Match (local 'single line' modifier)
  39   "ab" =~  /^.$/     # No match (dot matches one character)
  40  
  41  
  42  =head2 Backslashed sequences
  43  
  44  Perl regular expressions contain many backslashed sequences that
  45  constitute a character class. That is, they will match a single
  46  character, if that character belongs to a specific set of characters
  47  (defined by the sequence). A backslashed sequence is a sequence of
  48  characters starting with a backslash. Not all backslashed sequences
  49  are character class; for a full list, see L<perlrebackslash>.
  50  
  51  Here's a list of the backslashed sequences, which are discussed in
  52  more detail below.
  53  
  54   \d             Match a digit character.
  55   \D             Match a non-digit character.
  56   \w             Match a "word" character.
  57   \W             Match a non-"word" character.
  58   \s             Match a white space character.
  59   \S             Match a non-white space character.
  60   \h             Match a horizontal white space character.
  61   \H             Match a character that isn't horizontal white space.
  62   \v             Match a vertical white space character.
  63   \V             Match a character that isn't vertical white space.
  64   \pP, \p{Prop}  Match a character matching a Unicode property.
  65   \PP, \P{Prop}  Match a character that doesn't match a Unicode property.
  66  
  67  =head3 Digits
  68  
  69  C<\d> matches a single character that is considered to be a I<digit>.
  70  What is considered a digit depends on the internal encoding of
  71  the source string. If the source string is in UTF-8 format, C<\d>
  72  not only matches the digits '0' - '9', but also Arabic, Devanagari and
  73  digits from other languages. Otherwise, if there is a locale in effect,
  74  it will match whatever characters the locale considers digits. Without
  75  a locale, C<\d> matches the digits '0' to '9'.
  76  See L</Locale, Unicode and UTF-8>.
  77  
  78  Any character that isn't matched by C<\d> will be matched by C<\D>.
  79  
  80  =head3 Word characters
  81  
  82  C<\w> matches a single I<word> character: an alphanumeric character
  83  (that is, an alphabetic character, or a digit), or the underscore (C<_>).
  84  What is considered a word character depends on the internal encoding
  85  of the string. If it's in UTF-8 format, C<\w> matches those characters
  86  that are considered word characters in the Unicode database. That is, it
  87  not only matches ASCII letters, but also Thai letters, Greek letters, etc.
  88  If the source string isn't in UTF-8 format, C<\w> matches those characters
  89  that are considered word characters by the current locale. Without
  90  a locale in effect, C<\w> matches the ASCII letters, digits and the
  91  underscore.
  92  
  93  Any character that isn't matched by C<\w> will be matched by C<\W>.
  94  
  95  =head3 White space
  96  
  97  C<\s> matches any single character that is consider white space. In the
  98  ASCII range, C<\s> matches the horizontal tab (C<\t>), the new line
  99  (C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the
 100  space (the vertical tab, C<\cK> is not matched by C<\s>).  The exact set
 101  of characters matched by C<\s> depends on whether the source string is
 102  in UTF-8 format. If it is, C<\s> matches what is considered white space
 103  in the Unicode database. Otherwise, if there is a locale in effect, C<\s>
 104  matches whatever is considered white space by the current locale. Without
 105  a locale, C<\s> matches the five characters mentioned in the beginning
 106  of this paragraph.  Perhaps the most notable difference is that C<\s>
 107  matches a non-breaking space only if the non-breaking space is in a
 108  UTF-8 encoded string.
 109  
 110  Any character that isn't matched by C<\s> will be matched by C<\S>.
 111  
 112  C<\h> will match any character that is considered horizontal white space;
 113  this includes the space and the tab characters. C<\H> will match any character
 114  that is not considered horizontal white space.
 115  
 116  C<\v> will match any character that is considered vertical white space;
 117  this includes the carriage return and line feed characters (newline).
 118  C<\V> will match any character that is not considered vertical white space.
 119  
 120  C<\R> matches anything that can be considered a newline under Unicode
 121  rules. It's not a character class, as it can match a multi-character
 122  sequence. Therefore, it cannot be used inside a bracketed character
 123  class. Details are discussed in L<perlrebackslash>.
 124  
 125  C<\h>, C<\H>, C<\v>, C<\V>, and C<\R> are new in perl 5.10.0.
 126  
 127  Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match
 128  the same characters, regardless whether the source string is in UTF-8
 129  format or not. The set of characters they match is also not influenced
 130  by locale.
 131  
 132  One might think that C<\s> is equivalent with C<[\h\v]>. This is not true.
 133  The vertical tab (C<"\x0b">) is not matched by C<\s>, it is however
 134  considered vertical white space. Furthermore, if the source string is
 135  not in UTF-8 format, the next line (C<"\x85">) and the no-break space
 136  (C<"\xA0">) are not matched by C<\s>, but are by C<\v> and C<\h> respectively.
 137  If the source string is in UTF-8 format, both the next line and the
 138  no-break space are matched by C<\s>.
 139  
 140  The following table is a complete listing of characters matched by
 141  C<\s>, C<\h> and C<\v>.
 142  
 143  The first column gives the code point of the character (in hex format),
 144  the second column gives the (Unicode) name. The third column indicates
 145  by which class(es) the character is matched.
 146  
 147   0x00009        CHARACTER TABULATION   h s
 148   0x0000a              LINE FEED (LF)    vs
 149   0x0000b             LINE TABULATION    v
 150   0x0000c              FORM FEED (FF)    vs
 151   0x0000d        CARRIAGE RETURN (CR)    vs
 152   0x00020                       SPACE   h s
 153   0x00085             NEXT LINE (NEL)    vs  [1]
 154   0x000a0              NO-BREAK SPACE   h s  [1]
 155   0x01680            OGHAM SPACE MARK   h s
 156   0x0180e   MONGOLIAN VOWEL SEPARATOR   h s
 157   0x02000                     EN QUAD   h s
 158   0x02001                     EM QUAD   h s
 159   0x02002                    EN SPACE   h s
 160   0x02003                    EM SPACE   h s
 161   0x02004          THREE-PER-EM SPACE   h s
 162   0x02005           FOUR-PER-EM SPACE   h s
 163   0x02006            SIX-PER-EM SPACE   h s
 164   0x02007                FIGURE SPACE   h s
 165   0x02008           PUNCTUATION SPACE   h s
 166   0x02009                  THIN SPACE   h s
 167   0x0200a                  HAIR SPACE   h s
 168   0x02028              LINE SEPARATOR    vs
 169   0x02029         PARAGRAPH SEPARATOR    vs
 170   0x0202f       NARROW NO-BREAK SPACE   h s
 171   0x0205f   MEDIUM MATHEMATICAL SPACE   h s
 172   0x03000           IDEOGRAPHIC SPACE   h s
 173  
 174  =over 4
 175  
 176  =item [1]
 177  
 178  NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in
 179  UTF-8 format.
 180  
 181  =back
 182  
 183  It is worth noting that C<\d>, C<\w>, etc, match single characters, not
 184  complete numbers or words. To match a number (that consists of integers),
 185  use C<\d+>; to match a word, use C<\w+>.
 186  
 187  
 188  =head3 Unicode Properties
 189  
 190  C<\pP> and C<\p{Prop}> are character classes to match characters that
 191  fit given Unicode classes. One letter classes can be used in the C<\pP>
 192  form, with the class name following the C<\p>, otherwise, the property
 193  name is enclosed in braces, and follows the C<\p>. For instance, a
 194  match for a number can be written as C</\pN/> or as C</\p{Number}/>.
 195  Lowercase letters are matched by the property I<LowercaseLetter> which
 196  has as short form I<Ll>. They have to be written as C</\p{Ll}/> or
 197  C</\p{LowercaseLetter}/>. C</\pLl/> is valid, but means something different.
 198  It matches a two character string: a letter (Unicode property C<\pL>),
 199  followed by a lowercase C<l>.
 200  
 201  For a list of possible properties, see
 202  L<perlunicode/Unicode Character Properties>. It is also possible to
 203  defined your own properties. This is discussed in
 204  L<perlunicode/User-Defined Character Properties>.
 205  
 206  
 207  =head4 Examples
 208  
 209   "a"  =~  /\w/      # Match, "a" is a 'word' character.
 210   "7"  =~  /\w/      # Match, "7" is a 'word' character as well.
 211   "a"  =~  /\d/      # No match, "a" isn't a digit.
 212   "7"  =~  /\d/      # Match, "7" is a digit.
 213   " "  =~  /\s/      # Match, a space is white space.
 214   "a"  =~  /\D/      # Match, "a" is a non-digit.
 215   "7"  =~  /\D/      # No match, "7" is not a non-digit.
 216   " "  =~  /\S/      # No match, a space is not non-white space.
 217  
 218   " "  =~  /\h/      # Match, space is horizontal white space.
 219   " "  =~  /\v/      # No match, space is not vertical white space.
 220   "\r" =~  /\v/      # Match, a return is vertical white space.
 221  
 222   "a"  =~  /\pL/     # Match, "a" is a letter.
 223   "a"  =~  /\p{Lu}/  # No match, /\p{Lu}/ matches upper case letters.
 224  
 225   "\x{0e0b}" =~ /\p{Thai}/  # Match, \x{0e0b} is the character
 226                             # 'THAI CHARACTER SO SO', and that's in
 227                             # Thai Unicode class.
 228   "a"  =~  /\P{Lao}/ # Match, as "a" is not a Laoian character.
 229  
 230  
 231  =head2 Bracketed Character Classes
 232  
 233  The third form of character class you can use in Perl regular expressions
 234  is the bracketed form. In its simplest form, it lists the characters
 235  that may be matched inside square brackets, like this: C<[aeiou]>.
 236  This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Just as the other
 237  character classes, exactly one character will be matched. To match
 238  a longer string consisting of characters mentioned in the characters
 239  class, follow the character class with a quantifier. For instance,
 240  C<[aeiou]+> matches a string of one or more lowercase ASCII vowels.
 241  
 242  Repeating a character in a character class has no
 243  effect; it's considered to be in the set only once.
 244  
 245  Examples:
 246  
 247   "e"  =~  /[aeiou]/        # Match, as "e" is listed in the class.
 248   "p"  =~  /[aeiou]/        # No match, "p" is not listed in the class.
 249   "ae" =~  /^[aeiou]$/      # No match, a character class only matches
 250                             # a single character.
 251   "ae" =~  /^[aeiou]+$/     # Match, due to the quantifier.
 252  
 253  =head3 Special Characters Inside a Bracketed Character Class
 254  
 255  Most characters that are meta characters in regular expressions (that
 256  is, characters that carry a special meaning like C<*> or C<(>) lose
 257  their special meaning and can be used inside a character class without
 258  the need to escape them. For instance, C<[()]> matches either an opening
 259  parenthesis, or a closing parenthesis, and the parens inside the character
 260  class don't group or capture.
 261  
 262  Characters that may carry a special meaning inside a character class are:
 263  C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
 264  escaped with a backslash, although this is sometimes not needed, in which
 265  case the backslash may be omitted.
 266  
 267  The sequence C<\b> is special inside a bracketed character class. While
 268  outside the character class C<\b> is an assertion indicating a point
 269  that does not have either two word characters or two non-word characters
 270  on either side, inside a bracketed character class, C<\b> matches a
 271  backspace character.
 272  
 273  A C<[> is not special inside a character class, unless it's the start
 274  of a POSIX character class (see below). It normally does not need escaping.
 275  
 276  A C<]> is either the end of a POSIX character class (see below), or it
 277  signals the end of the bracketed character class. Normally it needs
 278  escaping if you want to include a C<]> in the set of characters.
 279  However, if the C<]> is the I<first> (or the second if the first
 280  character is a caret) character of a bracketed character class, it
 281  does not denote the end of the class (as you cannot have an empty class)
 282  and is considered part of the set of characters that can be matched without
 283  escaping.
 284  
 285  Examples:
 286  
 287   "+"   =~ /[+?*]/     #  Match, "+" in a character class is not special.
 288   "\cH" =~ /[\b]/      #  Match, \b inside in a character class
 289                        #  is equivalent with a backspace.
 290   "]"   =~ /[][]/      #  Match, as the character class contains.
 291                        #  both [ and ].
 292   "[]"  =~ /[[]]/      #  Match, the pattern contains a character class
 293                        #  containing just ], and the character class is
 294                        #  followed by a ].
 295  
 296  =head3 Character Ranges
 297  
 298  It is not uncommon to want to match a range of characters. Luckily, instead
 299  of listing all the characters in the range, one may use the hyphen (C<->).
 300  If inside a bracketed character class you have two characters separated
 301  by a hyphen, it's treated as if all the characters between the two are in
 302  the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]>
 303  matches any lowercase letter from the first half of the ASCII alphabet.
 304  
 305  Note that the two characters on either side of the hyphen are not
 306  necessary both letters or both digits. Any character is possible,
 307  although not advisable.  C<['-?]> contains a range of characters, but
 308  most people will not know which characters that will be. Furthermore,
 309  such ranges may lead to portability problems if the code has to run on
 310  a platform that uses a different character set, such as EBCDIC.
 311  
 312  If a hyphen in a character class cannot be part of a range, for instance
 313  because it is the first or the last character of the character class,
 314  or if it immediately follows a range, the hyphen isn't special, and will be
 315  considered a character that may be matched. You have to escape the hyphen
 316  with a backslash if you want to have a hyphen in your set of characters to
 317  be matched, and its position in the class is such that it can be considered
 318  part of a range.
 319  
 320  Examples:
 321  
 322   [a-z]       #  Matches a character that is a lower case ASCII letter.
 323   [a-fz]      #  Matches any letter between 'a' and 'f' (inclusive) or the
 324               #  letter 'z'.
 325   [-z]        #  Matches either a hyphen ('-') or the letter 'z'.
 326   [a-f-m]     #  Matches any letter between 'a' and 'f' (inclusive), the
 327               #  hyphen ('-'), or the letter 'm'.
 328   ['-?]       #  Matches any of the characters  '()*+,-./0123456789:;<=>?
 329               #  (But not on an EBCDIC platform).
 330  
 331  
 332  =head3 Negation
 333  
 334  It is also possible to instead list the characters you do not want to
 335  match. You can do so by using a caret (C<^>) as the first character in the
 336  character class. For instance, C<[^a-z]> matches a character that is not a
 337  lowercase ASCII letter.
 338  
 339  This syntax make the caret a special character inside a bracketed character
 340  class, but only if it is the first character of the class. So if you want
 341  to have the caret as one of the characters you want to match, you either
 342  have to escape the caret, or not list it first.
 343  
 344  Examples:
 345  
 346   "e"  =~  /[^aeiou]/   #  No match, the 'e' is listed.
 347   "x"  =~  /[^aeiou]/   #  Match, as 'x' isn't a lowercase vowel.
 348   "^"  =~  /[^^]/       #  No match, matches anything that isn't a caret.
 349   "^"  =~  /[x^]/       #  Match, caret is not special here.
 350  
 351  =head3 Backslash Sequences
 352  
 353  You can put a backslash sequence character class inside a bracketed character
 354  class, and it will act just as if you put all the characters matched by
 355  the backslash sequence inside the character class. For instance,
 356  C<[a-f\d]> will match any digit, or any of the lowercase letters between
 357  'a' and 'f' inclusive.
 358  
 359  Examples:
 360  
 361   /[\p{Thai}\d]/     # Matches a character that is either a Thai
 362                      # character, or a digit.
 363   /[^\p{Arabic}()]/  # Matches a character that is neither an Arabic
 364                      # character, nor a parenthesis.
 365  
 366  Backslash sequence character classes cannot form one of the endpoints
 367  of a range.
 368  
 369  =head3 Posix Character Classes
 370  
 371  Posix character classes have the form C<[:class:]>, where I<class> is
 372  name, and the C<[:> and C<:]> delimiters. Posix character classes appear
 373  I<inside> bracketed character classes, and are a convenient and descriptive
 374  way of listing a group of characters. Be careful about the syntax,
 375  
 376   # Correct:
 377   $string =~ /[[:alpha:]]/
 378  
 379   # Incorrect (will warn):
 380   $string =~ /[:alpha:]/
 381  
 382  The latter pattern would be a character class consisting of a colon,
 383  and the letters C<a>, C<l>, C<p> and C<h>.
 384  
 385  Perl recognizes the following POSIX character classes:
 386  
 387   alpha  Any alphabetical character.
 388   alnum  Any alphanumerical character.
 389   ascii  Any ASCII character.
 390   blank  A GNU extension, equal to a space or a horizontal tab (C<\t>).
 391   cntrl  Any control character.
 392   digit  Any digit, equivalent to C<\d>.
 393   graph  Any printable character, excluding a space.
 394   lower  Any lowercase character.
 395   print  Any printable character, including a space.
 396   punct  Any punctuation character.
 397   space  Any white space character. C<\s> plus the vertical tab (C<\cK>).
 398   upper  Any uppercase character.
 399   word   Any "word" character, equivalent to C<\w>.
 400   xdigit Any hexadecimal digit, '0' - '9', 'a' - 'f', 'A' - 'F'.
 401  
 402  The exact set of characters matched depends on whether the source string
 403  is internally in UTF-8 format or not. See L</Locale, Unicode and UTF-8>.
 404  
 405  Most POSIX character classes have C<\p> counterparts. The difference
 406  is that the C<\p> classes will always match according to the Unicode
 407  properties, regardless whether the string is in UTF-8 format or not.
 408  
 409  The following table shows the relation between POSIX character classes
 410  and the Unicode properties:
 411  
 412   [[:...:]]   \p{...}      backslash
 413  
 414   alpha       IsAlpha
 415   alnum       IsAlnum
 416   ascii       IsASCII
 417   blank
 418   cntrl       IsCntrl
 419   digit       IsDigit      \d
 420   graph       IsGraph
 421   lower       IsLower
 422   print       IsPrint
 423   punct       IsPunct
 424   space       IsSpace
 425               IsSpacePerl  \s
 426   upper       IsUpper
 427   word        IsWord
 428   xdigit      IsXDigit
 429  
 430  Some character classes may have a non-obvious name:
 431  
 432  =over 4
 433  
 434  =item cntrl
 435  
 436  Any control character. Usually, control characters don't produce output
 437  as such, but instead control the terminal somehow: for example newline
 438  and backspace are control characters. All characters with C<ord()> less
 439  than 32 are usually classified as control characters (in ASCII, the ISO
 440  Latin character sets, and Unicode), as is the character C<ord()> value
 441  of 127 (C<DEL>).
 442  
 443  =item graph
 444  
 445  Any character that is I<graphical>, that is, visible. This class consists
 446  of all the alphanumerical characters and all punctuation characters.
 447  
 448  =item print
 449  
 450  All printable characters, which is the set of all the graphical characters
 451  plus the space.
 452  
 453  =item punct
 454  
 455  Any punctuation (special) character.
 456  
 457  =back
 458  
 459  =head4 Negation
 460  
 461  A Perl extension to the POSIX character class is the ability to
 462  negate it. This is done by prefixing the class name with a caret (C<^>).
 463  Some examples:
 464  
 465   POSIX         Unicode       Backslash
 466   [[:^digit:]]  \P{IsDigit}   \D
 467   [[:^space:]]  \P{IsSpace}   \S
 468   [[:^word:]]   \P{IsWord}    \W
 469  
 470  =head4 [= =] and [. .]
 471  
 472  Perl will recognize the POSIX character classes C<[=class=]>, and
 473  C<[.class.]>, but does not (yet?) support this construct. Use of
 474  such a constructs will lead to an error.
 475  
 476  
 477  =head4 Examples
 478  
 479   /[[:digit:]]/            # Matches a character that is a digit.
 480   /[01[:lower:]]/          # Matches a character that is either a
 481                            # lowercase letter, or '0' or '1'.
 482   /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything,
 483                            # but the letters 'a' to 'f' in either case.
 484                            # This is because the character class contains
 485                            # all digits, and anything that isn't a
 486                            # hex digit, resulting in a class containing
 487                            # all characters, but the letters 'a' to 'f'
 488                            # and 'A' to 'F'.
 489  
 490  
 491  =head2 Locale, Unicode and UTF-8
 492  
 493  Some of the character classes have a somewhat different behaviour depending
 494  on the internal encoding of the source string, and the locale that is
 495  in effect.
 496  
 497  C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
 498  including C<\W>, C<\D>, C<\S>) suffer from this behaviour.
 499  
 500  The rule is that if the source string is in UTF-8 format, the character
 501  classes match according to the Unicode properties. If the source string
 502  isn't, then the character classes match according to whatever locale is
 503  in effect. If there is no locale, they match the ASCII defaults
 504  (52 letters, 10 digits and underscore for C<\w>, 0 to 9 for C<\d>, etc).
 505  
 506  This usually means that if you are matching against characters whose C<ord()>
 507  values are between 128 and 255 inclusive, your character class may match
 508  or not depending on the current locale, and whether the source string is
 509  in UTF-8 format. The string will be in UTF-8 format if it contains
 510  characters whose C<ord()> value exceeds 255. But a string may be in UTF-8
 511  format without it having such characters.
 512  
 513  For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
 514  or the POSIX character classes, and use the Unicode properties instead.
 515  
 516  =head4 Examples
 517  
 518   $str =  "\xDF";      # $str is not in UTF-8 format.
 519   $str =~ /^\w/;       # No match, as $str isn't in UTF-8 format.
 520   $str .= "\x{0e0b}";  # Now $str is in UTF-8 format.
 521   $str =~ /^\w/;       # Match! $str is now in UTF-8 format.
 522   chop $str;
 523   $str =~ /^\w/;       # Still a match! $str remains in UTF-8 format.
 524  
 525  =cut


Generated: Tue Mar 17 22:47:18 2015 Cross-referenced by PHPXref 0.7.1