Regular Expressions12. ASCII and Unicode |
|
Join the DiscussionWhile most of the time your regular expressions can define all of the necessary characters using the ordinary characters and the common escape characters that we have already looked at, there may be occasions where you want to specify a character for which no special escape character exists. Any ASCII or unicode character can be included in a regular expression even in those situations by identifying which ASCII or unicode character that you want to test for by identifying its code number using either the escape sequence \xnn for ASCII (where nn is the hexadecimal number of the character within the ASCII character set currently in use) or \unnnn (where nnnn is the hexadecimal number of the unicode character). For those of you who don't know what ASCII and unicode are, they are the two alternate ways that most computers use for encoding all of the different characters so that they can be stored as numbers (which is all that computers really understand). For example the letter 'A' in ASCII is represented by the decimal number 65 and the hexadecimal equivalent is 41. This would mean that A could be entered into a regular expression as \x41 although since A is both shorter and clearer you would not need to use \x41. You would only use the \xnn format when the character you want to test for is not available in a shorter format. For those of you who don't know what hexadecimal numbers are, they are numbers that use 16 distinct numerical values instead of the ten that we are all used to. A through F are usually used to represent these additional six values and so counting in hexadecimal goes 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, 10 (where 10 in hexadecimal is really 16 in the numbers you are normally used to). This allows all of the possible values between 0 and 255 that can be stored in a single byte in a computer to be represented as numbers between 00 and FF. Unicode is a double byte character set and so uses numbers between 0000 and FFFF to represent the 65536 different characters that it defines. |

