Words, Lines, and Special Characters Back
This chapter is mainly showing a variety of regular expression constructs and techniques in action, which may be helpful for you to solve specific problems you're facing.
Find a specific word
Problem
Find all occurrences of the word cat, case insensitively, and do not match any longest words such as hellcat, application, or Catwoman.
Solution
/\bcat\b/i
Discussion
A problem can occur when working with international text in JavaScript, since this kind of regular expression flavors only consiers letters in the ASCII table to create a word boundary. In other words, word boundaries are found only at the positions between a match of /[^A-Za-z0-9]|^/ and /[A-Za-z0-9_]/, or between /[A-Za-z0-9_]/ and /[^A-Za-z0-9_]|$/. It means that
\b
is not helpful when searching within text that contains accented (異國的) letters or words that use non-Latin scripts.For example, /\büber\b/ will find a match within darüber, but not within dar über.
The problem occurs because ü is considered a non-word character, and a word boundary is therefore found between the two character rü, while no word boundary is found between a space character and ü, because that creates a contiguous (連續的) sequence of non-word characters.
Due to this reason, we may have to consider a more complex way to work around the problem:
/** 8-bit-wide letter characters */ var pL = 'A-Za-z\xAA\xB5\xBA\xC0-\xD6\xD8-\xF6\xF8-\xFF'; var pattern = '([^{L}]|^)cat([^{L}]|$)'.replace(/{L}/g, pL); var regex = new RegExp(pattern, 'gi'); /** replace cat with dog, and put back any additional matched characters */ subject = subject.replace(regex, '$1dog$2');
Find any of multiple words
Problems
How to find any one out of a list of words, without having to search through the subject string multiple times?
Solution
The simplest way is to just alternate between the words you want to match:
/\b(?:one|two|three)\b/
A reusable way is to define a function:
function matchWords(subject, words) { var regexMetachars = /[(){[*+?.\\^$|]/g; var wordLen = words.length; /** * Any regex metacharacters within the accepted words are escaped * with a backslash before searching. */ for (var i = 0; i < wordLen; i++) { words[i] = words[i].replace(regexMetachars, '\\$&'); } var regex = new RegExp('\\b(?:' + words.join('|') + ')\\b', 'gi'); return subject.match(regex) || []; } matchWords(subject, ['one', 'two', 'three']);
Discussion
As the regex engine attempts to match each word in the list from left to right, you may find a slight performance gain by placing words that are most likely to be found in the subject text near the beginning of the list.
Find similar words
Problems
There're several cases of finding similar words
Solution
Color or colour
/\bcolou?r\b/i
Bat, cat, or rat
/\b[bcr]at\b/i
Words ending with "phobia"
/\b\w*phobia\b/i
Steve, Steven, or Stephen
/\bSte(?:ven?|phen)\b/i
Find all except a specific word
Problem
You may want to use a regular expression to match any word except cat.
Solution
You can use a negative lookahead in JavaScript:
/\b(?!cat\b)\w+\b/i
Discussion
Certainly, if you're trying to match any word that does not contain cat, you may need to use a slightly different way: /\b(?:(?!cat)\w)+\b/i.
Find any word not followed by a specific word
Problem
You may want to match any word that is not immediately followed by the word cat.
Solution
/\b\w+\b(?!\W+cat\b)/i
Discussion
If you want to only match words that are followed by cat without including cat, you can use another regular expression: /\b\w+\b(?=\W+cat\b)/i.
Find any word not preceded by a specific word
Problem
You may want to match any word that is not immediately preceded by the word cat.
Solution
In JavaScript, lookbehind is not supported and what we can do is just to simulate it:
var subject = 'My cat is fluffy'; var mainRegex = /\b\w+/g; var lookbehind = /\bcat\W+$/i; var lookbehindType = false; /** false for negative, true for positive */ var matches = []; var match; var leftContext; while (match = mainRegex.exec(subject)) { leftContext = subject.substring(0, match.index); if (lookbehindType === lookbehind.test(leftContext)) { matches.push(match[0]); } else { mainRegex.lastIndex = match.index + 1; } } console.log(matches); /** => ["My", "cat", "fluffly"] */
Discussion
In JavaScript, we may need to use two different patterns to simulate the process of lookbehind. The first pattern inside the lookbehind is /\bcat\W+/, and the second pattern that comes after it is /\b\w+/. If you're using the option
/m
for multiple line, you may prefer to use$(?!\s)
rather than$
, so that it can match only at the very end of the subject text.The
lookbehindType
variable controls whether we're emulating positive or negative lookbehind, whiletrue
for positive andfalse
for negative.Once match, the part of the subject text before the match will be copied into a new string variable named
leftContext
.By comparing the result of the lookbehind test to
lookbehindType
, we can determine whether the match meets the complete criteria for a successful match. Then if it's a successful match, then store it with an array. If not, just change the position for a next coming matching.
Find words near each other
Problem
How to search two words "word1" and "word2" near each other (less than five other words) with regular expressions?
Solution
/\b(?:word1\W+(?:\w+\W+){0,5}?word2|word2\W+(?:\w+\W+){0,5}?word1)\b/i
Discussion
If you simply want to test whether a list of words can be found anywhere in a subject string, you can use the regular expression: /^(?=[\s\S]*?\bword1\b)(?=[\s\S]*?\bword2\b)[\s\S]*/i.
Find repeated words
Problem
You're editing a document and would like to check it for any incorrectly repeated words. You want to find these doubled words despite capitalization differences, such as with "The the". Besides, you may also want to allow different amounts of whitespace between words, even if it causes the words to extend across more than one line.
Solution
/\b([A-Z]+)\s+\1\b/i
Discussion
\s+
matches any whitespace characters, such as spaces, tabs, or line breaks. If you want to restrict them that can separate repeated words to horizontal whitespace without line breaks, replace the\s
with[ \t\xA0]
, in which\xA0
matches a no-break space, a.k.a
.
Remove duplicate lines
Problem
You have a log file, database query output, or some other type of file or string with duplicate lines, and how to remove them?
Solution
There're three regex-based approaches that can be helpful:
Option 1: sort lines and remove them
function replaceDuplicate(content) { /** * sort the content * ... */ /** remove with the following regular expression */ content.replace(/^(.*)(?:(?:\r?\n|\r)\1)+/, '$1'); }
Option 2: Keep the last occurrence of each duplicate line in an unsorted file
function replaceDuplicate(content) { /** remove with the following regular expression */ content.replace(/^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)/, ''); }
Option 3: Keep the first occurrence of each duplicate line in an unsorted file
function replaceDuplicate(content) { /** remove with the following regular expression */ content.replace(/^(.*)$([\s\S]*?)(?:(?:\r?\n|\r)\1$)+/, '$1$2'); }
Match complete lines that contain a word
Problem
You may want to match all lines that contain the word error anywhere within them. How?
Solution
/^.*\berror\b.*$/i
Match complete lines that do not contain a word
Problem
You may want to match complete lines that do not contain the word error. How?
Solution
/^(?:(?!\berror\b).)*$/i
Discussion
As you can see, the negative lookahead (
(?!)
) and a dot (.
) are repeated together using a non-capturing group, which is in order to ensure that the regex\berror\b
fails at every position in the line.Testing a negative lookahead against every position in a line or string is rather inefficient, and when programming, it's more efficient to search through text line by line.
Trim leading and trailing (結尾的) whitespace
Problem
How to remove leading and trailing whitespace from a string?
Solution
Trim the leading one
function trimLeadingWhitespace(subject) { subject.replace('/^\s+/', ''); }
Trim the trailing one:
function trimTrialingWhitespace(subject) { subject.replace('/\s+$/', ''); }
Discussion
You may consider using
String.prototype.trim()
to complete the job for you, but for older browsers, you may also consider adding a polyfill for such a new method:if (!String.prototype.trim) { String.prototype.trim = function () { return this.replace(/^\s+/, '').replace(/\s+$/, ''); }; }
Certainly, there're also many other ways to trim a string, but the alternatives are usually slower than using two simple regular expression above.
function trim(subject) { subject.replace(/^\s+|\s+$/g); }
function trim(subject) { subject.replace(/^\s*([\s\S]*?)\s*$/, '$1'); }
function trim(subject) { subject.replace(/^\s*([\s\S]*\S)?\s*$/, '$1'); }
function trim(subject) { subject.replace(/^\s*(\S*(?:\s+\S+)*)\s*$/, '$1'); }
Replace repeated whitespace with a single space
Problem
How to replace all types of whitespaces with a single space, such as any tabs, line breaks or other whitespace.
Solution
Clean any whitespace characters
function convert(subject) { subject.replace(/\s+/, ' '); }
Clean horizontal whitespace characters
function convert(subject) { subject.replace(/ \t\xA0/, ' '); }
As the plugin is integrated with a code management system like GitLab or GitHub, you may have to auth with your account before leaving comments around this article.
Notice: This plugin has used Cookie to store your token with an expiration.