JYuan Learning Log: Learning Regular Expression

Regex Performance Optimization Tips and Tricks

Use the any (dot) operator sparingly, if you can do it any other way, do it, dot will always bite you...
Use the non-capturing group (?:pattern) when you need to repeat a grouping but don't need to use the captured value that comes from a traditional (capturing) group.
Use the atomic group (or non-backtracking subexpression) when applicable (?>pattern).
Avoid catastrophic backtracking like the plague by designing your regular expressions to terminate early for non-matches.

http://www.javaworld.com/article/2077757/core-java/optimizing-regular-expressions-in-java.html
The java.util.regex package uses a type of pattern-matching engine called a Nondeterministic Finite Automaton, or NFA. It's called nondeterministic because while trying to match a regular expression on a given string, each character in the input string might be checked several times against different parts of the regular expression.

Simple ways to optimize regular expressions
If you will use a regular expression more than once in your program, be sure to compile the pattern using Pattern.compile() instead of the more direct Pattern.matches(). Also remember that you can re-use the Matcher object for different input strings by calling the method reset().
Beware of alternation. Regular expressions like "(X|Y|Z)" have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of "(abcd|abef)" use "ab(cd|ef)". The latter is faster because the NFA will try to match ab and won't try any of the alternatives if it doesn't find it. Alternation really can slow down your programs. The expression ".*(abcd|efgh|ijkl).*" was three times slower in my test than using three calls to String.indexOf(), one for each alternative in the regular expression.
Capturing groups incur a small-time penalty each time you use them. If you don't really need to capture the text inside a group, always use non-capturing groups. For example, use "(?:X)" instead of "(X)".

Optimizing greedy and reluctant quantifiers
A greedy quantifier such as "*" or "+" will first try to match as many characters as possible from an input string, even if this means that the input string will not have sufficient characters left in it to match the rest of the regular expression. If this happens, the greedy quantifier will backtrack, returning characters until an overall match is found or until there are no more characters. A reluctant (or lazy) quantifier, on the other hand, will first try to match as few characters in the input string as possible.

.*a .*?a
Possessive quantifiers and independent grouping
<span[^>]*id\s*=\s*(?:"|')?JDK_contents(?:'|")?[^>]*>([^<]*)(.*)(<span[^>]*(id\s*=\s?:"|')?Ambiguity_between_a_JDK_and_an_SDK(?:'|")?[^>]*>[^<]*.*)

<span[^>]*id\s*=\s*(?:"|')?JDK_contents(?:'|")?[^>]*>([^<]*)(.*)(<span[^>]*(id\s*=\s?:"|')?Ambiguity_between_a_JDK_and_an_SDK(?:'|")?[^>]*>[^<]*.*)

(?m)
String pattern2 = "(?m)User Comments: (\\W)*(\\S)*";

Lookaround constructs
Positive lookahead: "(?=X)"
Negative lookahead: "(?!X)"
Positive lookbehind: "(?<=X)"
Negative lookbehind: "(?<!X)"
they don't actually change the current position in the input string.

http://stackoverflow.com/questions/5854063/how-to-optimize-regular-expression-performance
You can optimize a regex by using atomic grouping or using possessive quantifiers where possible.

Also, if your have stuff like .* or .+ in your regex, which can be real memory/runtime hogs, replace them with (possessive) character classes (again, if possible).
This is some (text|other text)
Another way would be to use a Trie-structure to find the prefixes. This is more robust, but a little more complicated.

http://www.rexegg.com/regex-tricks.html
In alternations, put the most likely cases first
Use \.(?:com|net|biz)\b rather than \.(?:biz|net|com)\b

Expose literal characters
AA* should be faster than A+
th(?:is|at) should be faster than this|that.
Use and Expose Anchors
^.*abc fails faster than .*abc
^(?:abc|def) is preferred to (?:^abc|^def).

http://www.regular-expressions.info/atomic.html
Atomic Grouping (?>group)
An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group. Atomic groups are non-capturing.

Taking Advantage of Ordered Alternation
Lookaround
They simply test whether their subexpression can and can’t match starting at the current location (lookahead), or ending at the current location (lookbehind).
perl -ne 'print if /(?i)ancyent (?=marinere)/' rime.txt
ack '(?i)ancyent (?=ma)' rime.txt

Negative Lookahead
This means that as you try to match a pattern, you won’t find a given lookahead pattern.
(?i)ancyent (?!marinere)

Positive Lookbehinds
(?i)(?<=ancyent) marinere

Negative Lookbehinds
(?1)(?<!ancyent) marinere

Tools for Working with Regular Expressions
RegexBuddy
http://www.regexbuddy.com/RegexBuddyCookbook.exe
the trial is fully functional for seven days of actual use.

http://www.regexmagic.com/RegexMagicCookbook.exe the trial is fully functional for seven days of actual use.
RegexPal
it supports only the JavaScript regex flavor.
RegexMagic: Regular Expression Generator
Expresso
http://www.ultrapico.com/Expresso.htm
http://www.regexplanet.com/
http://www.regexplanet.com/advanced/java/index.html
as a Java string
http://www.myregexp.com/

Capture and Name Parts of the Match
Create another regular expression that matches “magical” dates in yyyy-mm-dd format. A date is magical if the year minus the century, the month, and the day of the month are all the same numbers. For example, 2008-08-08 is a magical date. Capture the magical number (08 in the example), and label it “magic.”

Named capture
\b(?<year>\d\d\d\d)-(?<month>\d\d)-(?<day>\d\d)\b
Named backreferences
\b\d\d(?<magic>\d\d)-\k<magic>-\k<magic>\b
these recipes use numbered capturing groups and numbered backreferences. Each group automatically gets a number, which you use for the backreference. Named groups make your regular expression more readable and easier to maintain. Inserting a capturing group into an existing regex can change the numbers assigned to all the capturing groups. Names that you assign remain the same.

Java 7 also copied the .NET syntax, but only the variant using angle brackets.
don’t mix named and numbered groups in the same regex.
Groups with the same name - Perl 5.10, Ruby 1.9, and .NET Only
When a regular expression uses alternation to find different variations of certain text, using capturing groups with the same name makes it easy to extract parts from the match, regardless of which alternative actually matched the text.

Match a pair of and XHTML tags and the text between them. The text between the tags can include other XHTML tags.
.*?
All the quantifiers discussed are greedy, meaning they try to repeat as many times as possible, giving back only when required to allow the remainder of the regular expression to match.

The quantifiers ‹*› and ‹*?› allow all the same regular expression matches. The only difference is the order in which the possible matches are tried. The greedy quantifier will find the longest possible match. The lazy quantifier will find the shortest possible match.

\b\d+\b: be more specific, use \b
\b\d++\b
Eliminate Needless Backtracking
\b\d++\b ==> possessive quantifier
\b(?>\d+)\b ==> atomic group
A possessive quantifier is similar to a greedy quantifier: it tries to repeat as many times as possible. The difference is that a possessive quantifier will never give back, not even when giving back is the only way that the remainder of the regular expression could match. Possessive quantifiers do not keep backtracking positions.
You can make any quantifier possessive by placing a plus sign after it. For example, ‹*+›, ‹++›, ‹?+›, and ‹{7,42}+› are all possessive quantifiers.

Wrapping a greedy quantifier inside an atomic group has the exact same effect as using a possessive quantifier. When the regex engine exits the atomic group, all backtracking positions remembered by quantifiers and alternation inside the group are thrown away. The syntax is ‹(?>?)›, where ‹?› is any regular expression. An atomic group is essentially a noncapturing group, with the extra job of refusing to backtrack. The question mark is not a quantifier; the opening bracket simply consists of the three characters ‹(?>›.
\b\d++\b
123abc 456
the possessive quantifier as failing to remember backtracking positions, and the atomic group as throwing them away.
In many flavors, ‹x++› is merely syntactic sugar for ‹(?>x+)›, and both are implemented in exactly the same way. Whether the engine never remembers backtracking positions or throws them away later is irrelevant for the final outcome of the match attempt.
‹\w++\d++›, which is the same as ‹(?>\w+)(?>\d+)›, will not match abc123. ‹\w++› matches abc123 entirely
‹(?>\w+\d+)› has two greedy quantifiers inside the same atomic group. Within the atomic group, backtracking occurs normally. Backtracking positions are thrown away only when the engine exits the whole group.
Possessive quantifiers and atomic grouping don’t just optimize regular expressions. They can alter the matches found by a regular expression by eliminating those that would be reached through backtracking.

Prevent Runaway Repetition - Using atomic group - catastrophic backtracking
Use a single regular expression to match a complete HTML file, checking for properly nested html, head, title, and body tags. The regular expression must fail efficiently on HTML files that do not have the proper tags.
Correct Version:
<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>
Wrong version:
<html>.*?<head>.*?<title>.*?</title>?.*?</head>.*?<body[^>]*>.*?</body>.*?</html>
you can solve this problem by doing a literal text search for each of the tags one by one, searching for the next tag through the remainder of the subject text after the one last found.

The lazy asterisk makes sure the regex goes ahead only one character at a time, each time checking whether the next tag can be matched.

Catastrophic backtracking is an instance of a phenomenon known as a combinatorial explosion, in which several orthogonal conditions intersect and all combinations have to be tried. You could also say that the regex is a Cartesian product of the various repetition operators.

The solution is to use atomic grouping to prevent needless backtracking. There is no need for the sixth ‹.*?› to expand after ‹</body>› has matched. If ‹</html>› fails, expanding the sixth lazy dot will not magically produce a closing html tag.
To make a quantified regular expression token stop when the following delimiter matches, place both the quantified part of the regex and the delimiter together in an atomic group: ‹(?>.*?</body>)›. Now the regex engine throws away all the matching positions for ‹.*?</body>› when ‹</body>› is found. If ‹</html>› later fails, the regex engine has forgotten about ‹.*?</body>›, and no further expansion will occur.

Essentially, you have to watch out when two or more parts of the regular expression can match the same text. In these cases, you may need atomic grouping to make sure the regex engine doesn’t try all possible ways of dividing the subject text between those two parts of the regex. Always test your regex on (long) test subjects that contain text that can be partially but not entirely matched by the regex.

Wrong: (x+x+)+y
xx+y Best: (?>x+)y

Test for a Match Without Adding It to the Overall Match
Find any word that occurs between a pair of HTML bold tags, without including the tags in the regex match. For instance, if the subject is My cat is furry, the only valid match should be cat.
(?<=)\w+(?=)

Lookbehind is a different story. Regular expression software has always been designed to search the text from left to right only. Searching backward is often implemented as a bit of a hack: the regex engine determines how many characters you put inside the lookbehind, jumps back that many characters, and then compares the text in the lookbehind with the text in the subject from left to right.

Java takes lookbehind one step further. Java allows any finite-length regular expression inside lookbehind. This means you can use anything except the infinite quantifiers ‹*›, ‹+›, and ‹{42,}› inside lookbehind. Internally, Java’s regex engine calculates the minimum and maximum length of the text that could possibly be matched by the part of the regex in the lookbehind. It then jumps back the minimum number of characters, and applies the regex in the lookbehind from left to right. If this fails, the engine jumps back one more character and tries again, until either the lookbehind matches or the maximum number of characters has been tried.

If the lookahead is at the end
of the regex, you will indeed end up with capturing groups that match text not matched.
The only situation in which the atomic nature of lookaround can alter the overall regex
match is when you use a backreference outside the lookaround to a capturing group
created inside the lookaround.

(?=(\d+))\w+\1
Solution Without Lookbehind
()(\w+)(?=)

Using backreference

\$%\\*\$1\\1
\b(?:one|two|three)\b
Without the word boundaries, however, it might be important to put longer
words first; otherwise, you’d never find “awesome” when searching for ‹awe|awe
some›

placing the
words that are most likely to be found in the subject text near the beginning
of the list.

‹\b(?:one|t(?:wo|hree))\b›. Don’t go crazy with such hand-tuning, though.
Most regex engines try to perform this optimization for you automatically
function matchWords(subject, words) {
var regexMetachars = /[(){[*+?.\\^$|]/g;
for (var i = 0; i < words.length; i++) {
words[i] = words[i].replace(regexMetachars, "\\$&");
}
var regex = new RegExp("\\b(?:" + words.join("|") + ")\\b", "gi");
return subject.match(regex) || [];
}

Find Similar Words
\bcolou?r\b
\b[bcr]at\b
You could do the same thing using ‹\b(?:b|c|r)at\b›, ‹\b(?:bat|cat|
rat)\b›, or ‹\bbat\b|\bcat\b|\brat\b›.
Not only do character classes provide a more compact and readable
syntax (thanks to being able to drop all the vertical bars and use ranges such as A–Z),
most regex engines also provide far superior optimization for character classes. Alternation
using vertical bars requires the engine to use the computationally expensive
backtracking algorithm, whereas character classes use a simpler search approach.
\b\w*phobia\b
Steve, Steven, or Stephen
\bSte(?:ven?|phen)\b
This improves efficiency (and brevity) versus the equivalent
‹\bSte(?:ve|ven|phen)\b›. The same principle explains why the literal string ‹Ste› appears
at the front of the regular expression, rather than being repeated three times as
with ‹\b(?:Steve|Steven|Stephen)\b› or ‹\bSteve\b|\bSteven\b|\bStephen\b›.
\breg(?:ular●expressions?|ex(?:ps?|e[sn])?)\b
Use word boundaries to match complete words

Character classes are only capable of matching
one character at a time from the characters specified within them—no exceptions.
Following are two of the most common ways that character classes are misused:
Putting words in character classes
Trying to use the alternation operator within character classes

Find All Except a Specific Word
You want to use a regular expression to match any complete word except cat.
Catwoman, vindicate, and other words that merely contain the letters “cat” should be
matched—just not cat.
A negative lookahead can help you rule out specific words, and is key to this next regex:
\b(?!cat\b)\w+

Find words that don’t contain another word
\b(?:(?!cat)\w)+\b
data: categorically match any word except cat

Find Any Word Not Followed by a Specific Word
You want to match any word that is not immediately followed by the word cat, ignoring
any whitespace, punctuation, or other nonword characters that appear in between
\b\w+\b(?!\W+cat\b)

word boundaries (‹\b›) and the word character
token (‹\w›) work together to match a complete word.
the ‹\W+› matches one or more nonword characters,
such as whitespace and punctuation

\b(?!cat\b)\w+\b(?!\W+cat\b)
\b\w+\b(?=\W+cat\b)

Find Any Word Not Preceded by a Specific Word
You want to match any word that is not immediately preceded by the word cat, ignoring
any whitespace, punctuation, or other nonword characters that come between

Words not preceded by “cat”
(?<!\bcat\W+)\b\w+
Limited number of separating nonword characters:
(?<!\bcat\W{1,9})\b\w+

Find Words Near Each Other
If you’re searching for just two different words, you can combine two regular
expressions—one that matches word1 before word2, and another that flips the order of
the words.
\b(?:word1\W+(?:\w+\W+){0,5}?word2|word2\W+(?:\w+\W+){0,5}?word1)\b

The shorthand character classes that are used to match word characters and nonword
characters (‹\w› and ‹\W›, respectively) follow the quirky regular expression definition
of which characters words are composed of (letters, numbers, and underscore).

Match three or more words near each other
Exponentially increasing permutations.

Find Repeated Words: negative lookahead
\b([A-Z]+)\s+\1\b
Match Complete Lines That Contain a Word
^.*\berror\b.*$
Match Complete Lines That Do Not Contain a Word: using negative lookahead
^(?:(?!\berror\b).)*$
a negative lookahead
and a dot are repeated together using a noncapturing group. This is necessary to ensure
that the regex ‹\berror\b› fails at every position in the line

Trim Leading and Trailing Whitespace
Leading whitespace:
\A\s+
^\s+
Trailing whitespace:
\s+\Z
\s+$

$string =~ s/^\s+//;
$string =~ s/\s+$//;

Escape Regular Expression Metacharacters
Java Pattern.quote(str)Regex Performance Optimization Tips and Tricks

Use the any (dot) operator sparingly, if you can do it any other way, do it, dot will always bite you...
Use the non-capturing group (?:pattern) when you need to repeat a grouping but don't need to use the captured value that comes from a traditional (capturing) group.
Use the atomic group (or non-backtracking subexpression) when applicable (?>pattern).
Avoid catastrophic backtracking like the plague by designing your regular expressions to terminate early for non-matches.

http://www.javaworld.com/article/2077757/core-java/optimizing-regular-expressions-in-java.html
The java.util.regex package uses a type of pattern-matching engine called a Nondeterministic Finite Automaton, or NFA. It's called nondeterministic because while trying to match a regular expression on a given string, each character in the input string might be checked several times against different parts of the regular expression.

Simple ways to optimize regular expressions
If you will use a regular expression more than once in your program, be sure to compile the pattern using Pattern.compile() instead of the more direct Pattern.matches(). Also remember that you can re-use the Matcher object for different input strings by calling the method reset().
Beware of alternation. Regular expressions like "(X|Y|Z)" have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of "(abcd|abef)" use "ab(cd|ef)". The latter is faster because the NFA will try to match ab and won't try any of the alternatives if it doesn't find it. Alternation really can slow down your programs. The expression ".*(abcd|efgh|ijkl).*" was three times slower in my test than using three calls to String.indexOf(), one for each alternative in the regular expression.
Capturing groups incur a small-time penalty each time you use them. If you don't really need to capture the text inside a group, always use non-capturing groups. For example, use "(?:X)" instead of "(X)".

Optimizing greedy and reluctant quantifiers
A greedy quantifier such as "*" or "+" will first try to match as many characters as possible from an input string, even if this means that the input string will not have sufficient characters left in it to match the rest of the regular expression. If this happens, the greedy quantifier will backtrack, returning characters until an overall match is found or until there are no more characters. A reluctant (or lazy) quantifier, on the other hand, will first try to match as few characters in the input string as possible.

.*a .*?a
Possessive quantifiers and independent grouping
<span[^>]*id\s*=\s*(?:"|')?JDK_contents(?:'|")?[^>]*>([^<]*)(.*)(<span[^>]*(id\s*=\s?:"|')?Ambiguity_between_a_JDK_and_an_SDK(?:'|")?[^>]*>[^<]*.*)

<span[^>]*id\s*=\s*(?:"|')?JDK_contents(?:'|")?[^>]*>([^<]*)(.*)(<span[^>]*(id\s*=\s?:"|')?Ambiguity_between_a_JDK_and_an_SDK(?:'|")?[^>]*>[^<]*.*)

(?m)
String pattern2 = "(?m)User Comments: (\\W)*(\\S)*";

Lookaround constructs
Positive lookahead: "(?=X)"
Negative lookahead: "(?!X)"
Positive lookbehind: "(?<=X)"
Negative lookbehind: "(?<!X)"
they don't actually change the current position in the input string.

http://stackoverflow.com/questions/5854063/how-to-optimize-regular-expression-performance
You can optimize a regex by using atomic grouping or using possessive quantifiers where possible.

Also, if your have stuff like .* or .+ in your regex, which can be real memory/runtime hogs, replace them with (possessive) character classes (again, if possible).
This is some (text|other text)
Another way would be to use a Trie-structure to find the prefixes. This is more robust, but a little more complicated.

http://www.rexegg.com/regex-tricks.html
In alternations, put the most likely cases first
Use \.(?:com|net|biz)\b rather than \.(?:biz|net|com)\b

Expose literal characters
AA* should be faster than A+
th(?:is|at) should be faster than this|that.
Use and Expose Anchors
^.*abc fails faster than .*abc
^(?:abc|def) is preferred to (?:^abc|^def).

http://www.regular-expressions.info/atomic.html
Atomic Grouping (?>group)
An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group. Atomic groups are non-capturing.

Taking Advantage of Ordered Alternation
Lookaround
They simply test whether their subexpression can and can’t match starting at the current location (lookahead), or ending at the current location (lookbehind).
perl -ne 'print if /(?i)ancyent (?=marinere)/' rime.txt
ack '(?i)ancyent (?=ma)' rime.txt

Negative Lookahead
This means that as you try to match a pattern, you won’t find a given lookahead pattern.
(?i)ancyent (?!marinere)

Positive Lookbehinds
(?i)(?<=ancyent) marinere

Negative Lookbehinds
(?1)(?<!ancyent) marinere

Tools for Working with Regular Expressions
RegexBuddy
http://www.regexbuddy.com/RegexBuddyCookbook.exe
the trial is fully functional for seven days of actual use.

http://www.regexmagic.com/RegexMagicCookbook.exe the trial is fully functional for seven days of actual use.
RegexPal
it supports only the JavaScript regex flavor.
RegexMagic: Regular Expression Generator
Expresso
http://www.ultrapico.com/Expresso.htm
http://www.regexplanet.com/
http://www.regexplanet.com/advanced/java/index.html
as a Java string
http://www.myregexp.com/

Capture and Name Parts of the Match
Create another regular expression that matches “magical” dates in yyyy-mm-dd format. A date is magical if the year minus the century, the month, and the day of the month are all the same numbers. For example, 2008-08-08 is a magical date. Capture the magical number (08 in the example), and label it “magic.”

Named capture
\b(?<year>\d\d\d\d)-(?<month>\d\d)-(?<day>\d\d)\b
Named backreferences
\b\d\d(?<magic>\d\d)-\k<magic>-\k<magic>\b
these recipes use numbered capturing groups and numbered backreferences. Each group automatically gets a number, which you use for the backreference. Named groups make your regular expression more readable and easier to maintain. Inserting a capturing group into an existing regex can change the numbers assigned to all the capturing groups. Names that you assign remain the same.

Java 7 also copied the .NET syntax, but only the variant using angle brackets.
don’t mix named and numbered groups in the same regex.
Groups with the same name - Perl 5.10, Ruby 1.9, and .NET Only
When a regular expression uses alternation to find different variations of certain text, using capturing groups with the same name makes it easy to extract parts from the match, regardless of which alternative actually matched the text.

Match a pair of and XHTML tags and the text between them. The text between the tags can include other XHTML tags.
.*?
All the quantifiers discussed are greedy, meaning they try to repeat as many times as possible, giving back only when required to allow the remainder of the regular expression to match.

The quantifiers ‹*› and ‹*?› allow all the same regular expression matches. The only difference is the order in which the possible matches are tried. The greedy quantifier will find the longest possible match. The lazy quantifier will find the shortest possible match.

\b\d+\b: be more specific, use \b
\b\d++\b
Eliminate Needless Backtracking
\b\d++\b ==> possessive quantifier
\b(?>\d+)\b ==> atomic group
A possessive quantifier is similar to a greedy quantifier: it tries to repeat as many times as possible. The difference is that a possessive quantifier will never give back, not even when giving back is the only way that the remainder of the regular expression could match. Possessive quantifiers do not keep backtracking positions.
You can make any quantifier possessive by placing a plus sign after it. For example, ‹*+›, ‹++›, ‹?+›, and ‹{7,42}+› are all possessive quantifiers.

Wrapping a greedy quantifier inside an atomic group has the exact same effect as using a possessive quantifier. When the regex engine exits the atomic group, all backtracking positions remembered by quantifiers and alternation inside the group are thrown away. The syntax is ‹(?>?)›, where ‹?› is any regular expression. An atomic group is essentially a noncapturing group, with the extra job of refusing to backtrack. The question mark is not a quantifier; the opening bracket simply consists of the three characters ‹(?>›.
\b\d++\b
123abc 456
the possessive quantifier as failing to remember backtracking positions, and the atomic group as throwing them away.
In many flavors, ‹x++› is merely syntactic sugar for ‹(?>x+)›, and both are implemented in exactly the same way. Whether the engine never remembers backtracking positions or throws them away later is irrelevant for the final outcome of the match attempt.
‹\w++\d++›, which is the same as ‹(?>\w+)(?>\d+)›, will not match abc123. ‹\w++› matches abc123 entirely
‹(?>\w+\d+)› has two greedy quantifiers inside the same atomic group. Within the atomic group, backtracking occurs normally. Backtracking positions are thrown away only when the engine exits the whole group.
Possessive quantifiers and atomic grouping don’t just optimize regular expressions. They can alter the matches found by a regular expression by eliminating those that would be reached through backtracking.

Prevent Runaway Repetition - Using atomic group - catastrophic backtracking
Use a single regular expression to match a complete HTML file, checking for properly nested html, head, title, and body tags. The regular expression must fail efficiently on HTML files that do not have the proper tags.
Correct Version:
<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>
Wrong version:
<html>.*?<head>.*?<title>.*?</title>?.*?</head>.*?<body[^>]*>.*?</body>.*?</html>
you can solve this problem by doing a literal text search for each of the tags one by one, searching for the next tag through the remainder of the subject text after the one last found.

The lazy asterisk makes sure the regex goes ahead only one character at a time, each time checking whether the next tag can be matched.

Catastrophic backtracking is an instance of a phenomenon known as a combinatorial explosion, in which several orthogonal conditions intersect and all combinations have to be tried. You could also say that the regex is a Cartesian product of the various repetition operators.

The solution is to use atomic grouping to prevent needless backtracking. There is no need for the sixth ‹.*?› to expand after ‹</body>› has matched. If ‹</html>› fails, expanding the sixth lazy dot will not magically produce a closing html tag.
To make a quantified regular expression token stop when the following delimiter matches, place both the quantified part of the regex and the delimiter together in an atomic group: ‹(?>.*?</body>)›. Now the regex engine throws away all the matching positions for ‹.*?</body>› when ‹</body>› is found. If ‹</html>› later fails, the regex engine has forgotten about ‹.*?</body>›, and no further expansion will occur.

Essentially, you have to watch out when two or more parts of the regular expression can match the same text. In these cases, you may need atomic grouping to make sure the regex engine doesn’t try all possible ways of dividing the subject text between those two parts of the regex. Always test your regex on (long) test subjects that contain text that can be partially but not entirely matched by the regex.

Wrong: (x+x+)+y
xx+y Best: (?>x+)y

Test for a Match Without Adding It to the Overall Match
Find any word that occurs between a pair of HTML bold tags, without including the tags in the regex match. For instance, if the subject is My cat is furry, the only valid match should be cat.
(?<=)\w+(?=)

Lookbehind is a different story. Regular expression software has always been designed to search the text from left to right only. Searching backward is often implemented as a bit of a hack: the regex engine determines how many characters you put inside the lookbehind, jumps back that many characters, and then compares the text in the lookbehind with the text in the subject from left to right.

Java takes lookbehind one step further. Java allows any finite-length regular expression inside lookbehind. This means you can use anything except the infinite quantifiers ‹*›, ‹+›, and ‹{42,}› inside lookbehind. Internally, Java’s regex engine calculates the minimum and maximum length of the text that could possibly be matched by the part of the regex in the lookbehind. It then jumps back the minimum number of characters, and applies the regex in the lookbehind from left to right. If this fails, the engine jumps back one more character and tries again, until either the lookbehind matches or the maximum number of characters has been tried.

If the lookahead is at the end
of the regex, you will indeed end up with capturing groups that match text not matched.
The only situation in which the atomic nature of lookaround can alter the overall regex
match is when you use a backreference outside the lookaround to a capturing group
created inside the lookaround.

(?=(\d+))\w+\1
Solution Without Lookbehind
()(\w+)(?=)

Using backreference

\$%\\*\$1\\1
\b(?:one|two|three)\b
Without the word boundaries, however, it might be important to put longer
words first; otherwise, you’d never find “awesome” when searching for ‹awe|awe
some›

placing the
words that are most likely to be found in the subject text near the beginning
of the list.

‹\b(?:one|t(?:wo|hree))\b›. Don’t go crazy with such hand-tuning, though.
Most regex engines try to perform this optimization for you automatically
function matchWords(subject, words) {
var regexMetachars = /[(){[*+?.\\^$|]/g;
for (var i = 0; i < words.length; i++) {
words[i] = words[i].replace(regexMetachars, "\\$&");
}
var regex = new RegExp("\\b(?:" + words.join("|") + ")\\b", "gi");
return subject.match(regex) || [];
}

Find Similar Words
\bcolou?r\b
\b[bcr]at\b
You could do the same thing using ‹\b(?:b|c|r)at\b›, ‹\b(?:bat|cat|
rat)\b›, or ‹\bbat\b|\bcat\b|\brat\b›.
Not only do character classes provide a more compact and readable
syntax (thanks to being able to drop all the vertical bars and use ranges such as A–Z),
most regex engines also provide far superior optimization for character classes. Alternation
using vertical bars requires the engine to use the computationally expensive
backtracking algorithm, whereas character classes use a simpler search approach.
\b\w*phobia\b
Steve, Steven, or Stephen
\bSte(?:ven?|phen)\b
This improves efficiency (and brevity) versus the equivalent
‹\bSte(?:ve|ven|phen)\b›. The same principle explains why the literal string ‹Ste› appears
at the front of the regular expression, rather than being repeated three times as
with ‹\b(?:Steve|Steven|Stephen)\b› or ‹\bSteve\b|\bSteven\b|\bStephen\b›.
\breg(?:ular●expressions?|ex(?:ps?|e[sn])?)\b
Use word boundaries to match complete words

Character classes are only capable of matching
one character at a time from the characters specified within them—no exceptions.
Following are two of the most common ways that character classes are misused:
Putting words in character classes
Trying to use the alternation operator within character classes

Find All Except a Specific Word
You want to use a regular expression to match any complete word except cat.
Catwoman, vindicate, and other words that merely contain the letters “cat” should be
matched—just not cat.
A negative lookahead can help you rule out specific words, and is key to this next regex:
\b(?!cat\b)\w+

Find words that don’t contain another word
\b(?:(?!cat)\w)+\b
data: categorically match any word except cat

Find Any Word Not Followed by a Specific Word
You want to match any word that is not immediately followed by the word cat, ignoring
any whitespace, punctuation, or other nonword characters that appear in between
\b\w+\b(?!\W+cat\b)

word boundaries (‹\b›) and the word character
token (‹\w›) work together to match a complete word.
the ‹\W+› matches one or more nonword characters,
such as whitespace and punctuation

\b(?!cat\b)\w+\b(?!\W+cat\b)
\b\w+\b(?=\W+cat\b)

Find Any Word Not Preceded by a Specific Word
You want to match any word that is not immediately preceded by the word cat, ignoring
any whitespace, punctuation, or other nonword characters that come between

Words not preceded by “cat”
(?<!\bcat\W+)\b\w+
Limited number of separating nonword characters:
(?<!\bcat\W{1,9})\b\w+

Find Words Near Each Other
If you’re searching for just two different words, you can combine two regular
expressions—one that matches word1 before word2, and another that flips the order of
the words.
\b(?:word1\W+(?:\w+\W+){0,5}?word2|word2\W+(?:\w+\W+){0,5}?word1)\b

The shorthand character classes that are used to match word characters and nonword
characters (‹\w› and ‹\W›, respectively) follow the quirky regular expression definition
of which characters words are composed of (letters, numbers, and underscore).

Match three or more words near each other
Exponentially increasing permutations.

Find Repeated Words: negative lookahead
\b([A-Z]+)\s+\1\b
Match Complete Lines That Contain a Word
^.*\berror\b.*$
Match Complete Lines That Do Not Contain a Word: using negative lookahead
^(?:(?!\berror\b).)*$
a negative lookahead
and a dot are repeated together using a noncapturing group. This is necessary to ensure
that the regex ‹\berror\b› fails at every position in the line

Trim Leading and Trailing Whitespace
Leading whitespace:
\A\s+
^\s+
Trailing whitespace:
\s+\Z
\s+$

$string =~ s/^\s+//;
$string =~ s/\s+$//;

Escape Regular Expression Metacharacters
Java Pattern.quote(str)

JYuan Learning Log

Sunday, February 2, 2014

Learning Regular Expression

No comments:

Post a Comment