There are three separate approaches to pattern matching provided by PostgreSQL: the traditional SQL LIKE
operator, the more recent SIMILAR TO
operator (added in SQL:1999), and POSIX-style regular expressions. Aside from the basic “does this string match this pattern?” operators, functions are available to extract or replace matching substrings and to split a string at matching locations.
The pattern matching operators of all three kinds do not support nondeterministic collations. If required, apply a different collation to the expression to work around this limitation.
9.7.1. LIKE
string
LIKEpattern
[ESCAPEescape-character
]string
NOT LIKEpattern
[ESCAPEescape-character
]
The LIKE
expression returns true if the string
matches the supplied pattern
. (As expected, the NOT LIKE
expression returns false if LIKE
returns true, and vice versa. An equivalent expression is NOT (
.)string
LIKE pattern
)
If pattern
does not contain percent signs or underscores, then the pattern only represents the string itself; in that case LIKE
acts like the equals operator. An underscore (_
) in pattern
stands for (matches) any single character; a percent sign (%
) matches any sequence of zero or more characters.
Some examples:
'abc' LIKE 'abc' true 'abc' LIKE 'a%' true 'abc' LIKE '_b_' true 'abc' LIKE 'c' false
LIKE
pattern matching always covers the entire string. Therefore, if it’s desired to match a sequence anywhere within a string, the pattern must start and end with a percent sign.
To match a literal underscore or percent sign without matching other characters, the respective character in pattern
must be preceded by the escape character. The default escape character is the backslash but a different one can be selected by using the ESCAPE
clause. To match the escape character itself, write two escape characters.
It’s also possible to select no escape character by writing ESCAPE ''
. This effectively disables the escape mechanism, which makes it impossible to turn off the special meaning of underscore and percent signs in the pattern.
According to the SQL standard, omitting ESCAPE
means there is no escape character (rather than defaulting to a backslash), and a zero-length ESCAPE
value is disallowed. PostgreSQL‘s behavior in this regard is therefore slightly nonstandard.
The key word ILIKE
can be used instead of LIKE
to make the match case-insensitive according to the active locale. This is not in the SQL standard but is a PostgreSQL extension.
The operator ~~
is equivalent to LIKE
, and ~~*
corresponds to ILIKE
. There are also !~~
and !~~*
operators that represent NOT LIKE
and NOT ILIKE
, respectively. All of these operators are PostgreSQL-specific. You may see these operator names in EXPLAIN
output and similar places, since the parser actually translates LIKE
et al. to these operators.
The phrases LIKE
, ILIKE
, NOT LIKE
, and NOT ILIKE
are generally treated as operators in PostgreSQL syntax; for example they can be used in expression
operator
ANY (subquery
) constructs, although an ESCAPE
clause cannot be included there. In some obscure cases it may be necessary to use the underlying operator names instead.
Also see the starts-with operator ^@
and the corresponding starts_with()
function, which are useful in cases where simply matching the beginning of a string is needed.
9.7.2. SIMILAR TO
Regular Expressions
string
SIMILAR TOpattern
[ESCAPEescape-character
]string
NOT SIMILAR TOpattern
[ESCAPEescape-character
]
The SIMILAR TO
operator returns true or false depending on whether its pattern matches the given string. It is similar to LIKE
, except that it interprets the pattern using the SQL standard’s definition of a regular expression. SQL regular expressions are a curious cross between LIKE
notation and common (POSIX) regular expression notation.
Like LIKE
, the SIMILAR TO
operator succeeds only if its pattern matches the entire string; this is unlike common regular expression behavior where the pattern can match any part of the string. Also like LIKE
, SIMILAR TO
uses _
and %
as wildcard characters denoting any single character and any string, respectively (these are comparable to .
and .*
in POSIX regular expressions).
In addition to these facilities borrowed from LIKE
, SIMILAR TO
supports these pattern-matching metacharacters borrowed from POSIX regular expressions:
-
|
denotes alternation (either of two alternatives). -
*
denotes repetition of the previous item zero or more times. -
+
denotes repetition of the previous item one or more times. -
?
denotes repetition of the previous item zero or one time. -
{
m
}
denotes repetition of the previous item exactlym
times. -
{
m
,}
denotes repetition of the previous itemm
or more times. -
{
m
,
n
}
denotes repetition of the previous item at leastm
and not more thann
times. -
Parentheses
()
can be used to group items into a single logical item. -
A bracket expression
[...]
specifies a character class, just as in POSIX regular expressions.
Notice that the period (.
) is not a metacharacter for SIMILAR TO
.
As with LIKE
, a backslash disables the special meaning of any of these metacharacters. A different escape character can be specified with ESCAPE
, or the escape capability can be disabled by writing ESCAPE ''
.
According to the SQL standard, omitting ESCAPE
means there is no escape character (rather than defaulting to a backslash), and a zero-length ESCAPE
value is disallowed. PostgreSQL‘s behavior in this regard is therefore slightly nonstandard.
Another nonstandard extension is that following the escape character with a letter or digit provides access to the escape sequences defined for POSIX regular expressions; see Table 9.20, Table 9.21, and Table 9.22 below.
Some examples:
'abc' SIMILAR TO 'abc' true 'abc' SIMILAR TO 'a' false 'abc' SIMILAR TO '%(b|d)%' true 'abc' SIMILAR TO '(b|c)%' false '-abc-' SIMILAR TO '%mabcM%' true 'xabcy' SIMILAR TO '%mabcM%' false
The substring
function with three parameters provides extraction of a substring that matches an SQL regular expression pattern. The function can be written according to standard SQL syntax:
substring(string
similarpattern
escapeescape-character
)
or using the now obsolete SQL:1999 syntax:
substring(string
frompattern
forescape-character
)
or as a plain three-argument function:
substring(string
,pattern
,escape-character
)
As with SIMILAR TO
, the specified pattern must match the entire data string, or else the function fails and returns null. To indicate the part of the pattern for which the matching data sub-string is of interest, the pattern should contain two occurrences of the escape character followed by a double quote ("
). The text matching the portion of the pattern between these separators is returned when the match is successful.
The escape-double-quote separators actually divide substring
‘s pattern into three independent regular expressions; for example, a vertical bar (|
) in any of the three sections affects only that section. Also, the first and third of these regular expressions are defined to match the smallest possible amount of text, not the largest, when there is any ambiguity about how much of the data string matches which pattern. (In POSIX parlance, the first and third regular expressions are forced to be non-greedy.)
As an extension to the SQL standard, PostgreSQL allows there to be just one escape-double-quote separator, in which case the third regular expression is taken as empty; or no separators, in which case the first and third regular expressions are taken as empty.
Some examples, with #"
delimiting the return string:
substring('foobar' similar '%#"o_b#"%' escape '#') oob substring('foobar' similar '#"o_b#"%' escape '#') NULL
9.7.3. POSIX Regular Expressions
Table 9.16 lists the available operators for pattern matching using POSIX regular expressions.
Table 9.16. Regular Expression Match Operators
Operator Description Example(s) |
---|
String matches regular expression, case sensitively
|
String matches regular expression, case insensitively
|
String does not match regular expression, case sensitively
|
String does not match regular expression, case insensitively
|
POSIX regular expressions provide a more powerful means for pattern matching than the LIKE
and SIMILAR TO
operators. Many Unix tools such as egrep
, sed
, or awk
use a pattern matching language that is similar to the one described here.
A regular expression is a character sequence that is an abbreviated definition of a set of strings (a regular set). A string is said to match a regular expression if it is a member of the regular set described by the regular expression. As with LIKE
, pattern characters match string characters exactly unless they are special characters in the regular expression language — but regular expressions use different special characters than LIKE
does. Unlike LIKE
patterns, a regular expression is allowed to match anywhere within a string, unless the regular expression is explicitly anchored to the beginning or end of the string.
Some examples:
'abcd' ~ 'bc' true 'abcd' ~ 'a.c' true — dot matches any character 'abcd' ~ 'a.*d' true —*
repeats the preceding pattern item 'abcd' ~ '(b|x)' true —|
means OR, parentheses group 'abcd' ~ '^a' true —^
anchors to start of string 'abcd' ~ '^(b|c)' false — would match except for anchoring
The POSIX pattern language is described in much greater detail below.
The substring
function with two parameters, substring(
, provides extraction of a substring that matches a POSIX regular expression pattern. It returns null if there is no match, otherwise the first portion of the text that matched the pattern. But if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned. You can put parentheses around the whole expression if you want to use parentheses within it without triggering this exception. If you need parentheses in the pattern before the subexpression you want to extract, see the non-capturing parentheses described below.string
from pattern
)
Some examples:
substring('foobar' from 'o.b') oob substring('foobar' from 'o(.)b') o
The regexp_count
function counts the number of places where a POSIX regular expression pattern matches a string. It has the syntax regexp_count
(string
, pattern
[, start
[, flags
]]). pattern
is searched for in string
, normally from the beginning of the string, but if the start
parameter is provided then beginning from that character index. The flags
parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. For example, including i
in flags
specifies case-insensitive matching. Supported flags are described in Table 9.24.
Some examples:
regexp_count('ABCABCAXYaxy', 'A.') 3 regexp_count('ABCABCAXYaxy', 'A.', 1, 'i') 4
The regexp_instr
function returns the starting or ending position of the N
‘th match of a POSIX regular expression pattern to a string, or zero if there is no such match. It has the syntax regexp_instr
(string
, pattern
[, start
[, N
[, endoption
[, flags
[, subexpr
]]]]]). pattern
is searched for in string
, normally from the beginning of the string, but if the start
parameter is provided then beginning from that character index. If N
is specified then the N
‘th match of the pattern is located, otherwise the first match is located. If the endoption
parameter is omitted or specified as zero, the function returns the position of the first character of the match. Otherwise, endoption
must be one, and the function returns the position of the character following the match. The flags
parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. Supported flags are described in Table 9.24. For a pattern containing parenthesized subexpressions, subexpr
is an integer indicating which subexpression is of interest: the result identifies the position of the substring matching that subexpression. Subexpressions are numbered in the order of their leading parentheses. When subexpr
is omitted or zero, the result identifies the position of the whole match regardless of parenthesized subexpressions.
Some examples:
regexp_instr('number of your street, town zip, FR', '[^,]+', 1, 2) 23 regexp_instr('ABCDEFGHI', '(c..)(...)', 1, 1, 0, 'i', 2) 6
The regexp_like
function checks whether a match of a POSIX regular expression pattern occurs within a string, returning boolean true or false. It has the syntax regexp_like
(string
, pattern
[, flags
]). The flags
parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. Supported flags are described in Table 9.24. This function has the same results as the ~
operator if no flags are specified. If only the i
flag is specified, it has the same results as the ~*
operator.
Some examples:
regexp_like('Hello World', 'world') false regexp_like('Hello World', 'world', 'i') true
The regexp_match
function returns a text array of matching substring(s) within the first match of a POSIX regular expression pattern to a string. It has the syntax regexp_match
(string
, pattern
[, flags
]). If there is no match, the result is NULL
. If a match is found, and the pattern
contains no parenthesized subexpressions, then the result is a single-element text array containing the substring matching the whole pattern. If a match is found, and the pattern
contains parenthesized subexpressions, then the result is a text array whose n
‘th element is the substring matching the n
‘th parenthesized subexpression of the pattern
(not counting “non-capturing” parentheses; see below for details). The flags
parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. Supported flags are described in Table 9.24.
Some examples:
SELECT regexp_match('foobarbequebaz', 'bar.*que'); regexp_match -------------- {barbeque} (1 row) SELECT regexp_match('foobarbequebaz', '(bar)(beque)'); regexp_match -------------- {bar,beque} (1 row)
Tip
In the common case where you just want the whole matching substring or NULL
for no match, the best solution is to use regexp_substr()
. However, regexp_substr()
only exists in PostgreSQL version 15 and up. When working in older versions, you can extract the first element of regexp_match()
‘s result, for example:
SELECT (regexp_match('foobarbequebaz', 'bar.*que'))[1]; regexp_match -------------- barbeque (1 row)
The regexp_matches
function returns a set of text arrays of matching substring(s) within matches of a POSIX regular expression pattern to a string. It has the same syntax as regexp_match
. This function returns no rows if there is no match, one row if there is a match and the g
flag is not given, or N
rows if there are N
matches and the g
flag is given. Each returned row is a text array containing the whole matched substring or the substrings matching parenthesized subexpressions of the pattern
, just as described above for regexp_match
. regexp_matches
accepts all the flags shown in Table 9.24, plus the g
flag which commands it to return all matches, not just the first one.
Some examples:
SELECT regexp_matches('foo', 'not there'); regexp_matches ---------------- (0 rows) SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g'); regexp_matches ---------------- {bar,beque} {bazil,barf} (2 rows)
Tip
In most cases regexp_matches()
should be used with the g
flag, since if you only want the first match, it’s easier and more efficient to use regexp_match()
. However, regexp_match()
only exists in PostgreSQL version 10 and up. When working in older versions, a common trick is to place a regexp_matches()
call in a sub-select, for example:
SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
This produces a text array if there’s a match, or NULL
if not, the same as regexp_match()
would do. Without the sub-select, this query would produce no output at all for table rows without a match, which is typically not the desired behavior.
The regexp_replace
function provides substitution of new text for substrings that match POSIX regular expression patterns. It has the syntax regexp_replace
(source
, pattern
, replacement
[, start
[, N
]] [, flags
]). (Notice that N
cannot be specified unless start
is, but flags
can be given in any case.) The source
string is returned unchanged if there is no match to the pattern
. If there is a match, the source
string is returned with the replacement
string substituted for the matching substring. The replacement
string can contain n
, where n
is 1 through 9, to indicate that the source substring matching the n
‘th parenthesized subexpression of the pattern should be inserted, and it can contain &
to indicate that the substring matching the entire pattern should be inserted. Write \
if you need to put a literal backslash in the replacement text. pattern
is searched for in string
, normally from the beginning of the string, but if the start
parameter is provided then beginning from that character index. By default, only the first match of the pattern is replaced. If N
is specified and is greater than zero, then the N
‘th match of the pattern is replaced. If the g
flag is given, or if N
is specified and is zero, then all matches at or after the start
position are replaced. (The g
flag is ignored when N
is specified.) The flags
parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. Supported flags (though not g
) are described in Table 9.24.
Some examples:
regexp_replace('foobarbaz', 'b..', 'X') fooXbaz regexp_replace('foobarbaz', 'b..', 'X', 'g') fooXX regexp_replace('foobarbaz', 'b(..)', 'X1Y', 'g') fooXarYXazY regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 0, 'i') X PXstgrXSQL fXnctXXn regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 3, 'i') A PostgrXSQL function
The regexp_split_to_table
function splits a string using a POSIX regular expression pattern as a delimiter. It has the syntax regexp_split_to_table
(string
, pattern
[, flags
]). If there is no match to the pattern
, the function returns the string
. If there is at least one match, for each match it returns the text from the end of the last match (or the beginning of the string) to the beginning of the match. When there are no more matches, it returns the text from the end of the last match to the end of the string. The flags
parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. regexp_split_to_table
supports the flags described in Table 9.24.
The regexp_split_to_array
function behaves the same as regexp_split_to_table
, except that regexp_split_to_array
returns its result as an array of text
. It has the syntax regexp_split_to_array
(string
, pattern
[, flags
]). The parameters are the same as for regexp_split_to_table
.
Some examples:
SELECT foo FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', 's+') AS foo; foo ------- the quick brown fox jumps over the lazy dog (9 rows) SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', 's+'); regexp_split_to_array ----------------------------------------------- {the,quick,brown,fox,jumps,over,the,lazy,dog} (1 row) SELECT foo FROM regexp_split_to_table('the quick brown fox', 's*') AS foo; foo ----- t h e q u i c k b r o w n f o x (16 rows)
As the last example demonstrates, the regexp split functions ignore zero-length matches that occur at the start or end of the string or immediately after a previous match. This is contrary to the strict definition of regexp matching that is implemented by the other regexp functions, but is usually the most convenient behavior in practice. Other software systems such as Perl use similar definitions.
The regexp_substr
function returns the substring that matches a POSIX regular expression pattern, or NULL
if there is no match. It has the syntax regexp_substr
(string
, pattern
[, start
[, N
[, flags
[, subexpr
]]]]). pattern
is searched for in string
, normally from the beginning of the string, but if the start
parameter is provided then beginning from that character index. If N
is specified then the N
‘th match of the pattern is returned, otherwise the first match is returned. The flags
parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. Supported flags are described in Table 9.24. For a pattern containing parenthesized subexpressions, subexpr
is an integer indicating which subexpression is of interest: the result is the substring matching that subexpression. Subexpressions are numbered in the order of their leading parentheses. When subexpr
is omitted or zero, the result is the whole match regardless of parenthesized subexpressions.
Some examples:
regexp_substr('number of your street, town zip, FR', '[^,]+', 1, 2) town zip regexp_substr('ABCDEFGHI', '(c..)(...)', 1, 1, 'i', 2) FGH
9.7.3.1. Regular Expression Details
PostgreSQL‘s regular expressions are implemented using a software package written by Henry Spencer. Much of the description of regular expressions below is copied verbatim from his manual.
Regular expressions (REs), as defined in POSIX 1003.2, come in two forms: extended REs or EREs (roughly those of egrep
), and basic REs or BREs (roughly those of ed
). PostgreSQL supports both forms, and also implements some extensions that are not in the POSIX standard, but have become widely used due to their availability in programming languages such as Perl and Tcl. REs using these non-POSIX extensions are called advanced REs or AREs in this documentation. AREs are almost an exact superset of EREs, but BREs have several notational incompatibilities (as well as being much more limited). We first describe the ARE and ERE forms, noting features that apply only to AREs, and then describe how BREs differ.
Note
PostgreSQL always initially presumes that a regular expression follows the ARE rules. However, the more limited ERE or BRE rules can be chosen by prepending an embedded option to the RE pattern, as described in Section 9.7.3.4. This can be useful for compatibility with applications that expect exactly the POSIX 1003.2 rules.
A regular expression is defined as one or more branches, separated by |
. It matches anything that matches one of the branches.
A branch is zero or more quantified atoms or constraints, concatenated. It matches a match for the first, followed by a match for the second, etc.; an empty branch matches the empty string.
A quantified atom is an atom possibly followed by a single quantifier. Without a quantifier, it matches a match for the atom. With a quantifier, it can match some number of matches of the atom. An atom can be any of the possibilities shown in Table 9.17. The possible quantifiers and their meanings are shown in Table 9.18.
A constraint matches an empty string, but matches only when specific conditions are met. A constraint can be used where an atom could be used, except it cannot be followed by a quantifier. The simple constraints are shown in Table 9.19; some more constraints are described later.
Table 9.17. Regular Expression Atoms
Atom | Description |
---|---|
( re ) |
(where re is any regular expression) matches a match for re , with the match noted for possible reporting |
(?: re ) |
as above, but the match is not noted for reporting (a “non-capturing” set of parentheses) (AREs only) |
. |
matches any single character |
[ chars ] |
a bracket expression, matching any one of the chars (see Section 9.7.3.2 for more detail) |
k |
(where k is a non-alphanumeric character) matches that character taken as an ordinary character, e.g., \ matches a backslash character |
c |
where c is alphanumeric (possibly followed by other characters) is an escape, see Section 9.7.3.3 (AREs only; in EREs and BREs, this matches c ) |
{ |
when followed by a character other than a digit, matches the left-brace character { ; when followed by a digit, it is the beginning of a bound (see below) |
x |
where x is a single character with no other significance, matches that character |
An RE cannot end with a backslash ().
Table 9.18. Regular Expression Quantifiers
Quantifier | Matches |
---|---|
* |
a sequence of 0 or more matches of the atom |
+ |
a sequence of 1 or more matches of the atom |
? |
a sequence of 0 or 1 matches of the atom |
{ m } |
a sequence of exactly m matches of the atom |
{ m ,} |
a sequence of m or more matches of the atom |
{ m , n } |
a sequence of m through n (inclusive) matches of the atom; m cannot exceed n |
*? |
non-greedy version of * |
+? |
non-greedy version of + |
?? |
non-greedy version of ? |
{ m }? |
non-greedy version of { m } |
{ m ,}? |
non-greedy version of { m ,} |
{ m , n }? |
non-greedy version of { m , n } |
The forms using {
...
}
are known as bounds. The numbers m
and n
within a bound are unsigned decimal integers with permissible values from 0 to 255 inclusive.
Non-greedy quantifiers (available in AREs only) match the same possibilities as their corresponding normal (greedy) counterparts, but prefer the smallest number rather than the largest number of matches. See Section 9.7.3.5 for more detail.
Note
A quantifier cannot immediately follow another quantifier, e.g., **
is invalid. A quantifier cannot begin an expression or subexpression or follow ^
or |
.
Table 9.19. Regular Expression Constraints
Constraint | Description |
---|---|
^ |
matches at the beginning of the string |
$ |
matches at the end of the string |
(?= re ) |
positive lookahead matches at any point where a substring matching re begins (AREs only) |
(?! re ) |
negative lookahead matches at any point where no substring matching re begins (AREs only) |
(?<= re ) |
positive lookbehind matches at any point where a substring matching re ends (AREs only) |
(?<! re ) |
negative lookbehind matches at any point where no substring matching re ends (AREs only) |
Lookahead and lookbehind constraints cannot contain back references (see Section 9.7.3.3), and all parentheses within them are considered non-capturing.
9.7.3.2. Bracket Expressions
A bracket expression is a list of characters enclosed in []
. It normally matches any single character from the list (but see below). If the list begins with ^
, it matches any single character not from the rest of the list. If two characters in the list are separated by -
, this is shorthand for the full range of characters between those two (inclusive) in the collating sequence, e.g., [0-9]
in ASCII matches any decimal digit. It is illegal for two ranges to share an endpoint, e.g., a-c-e
. Ranges are very collating-sequence-dependent, so portable programs should avoid relying on them.
To include a literal ]
in the list, make it the first character (after ^
, if that is used). To include a literal -
, make it the first or last character, or the second endpoint of a range. To use a literal -
as the first endpoint of a range, enclose it in [.
and .]
to make it a collating element (see below). With the exception of these characters, some combinations using [
(see next paragraphs), and escapes (AREs only), all other special characters lose their special significance within a bracket expression. In particular, is not special when following ERE or BRE rules, though it is special (as introducing an escape) in AREs.
Within a bracket expression, a collating element (a character, a multiple-character sequence that collates as if it were a single character, or a collating-sequence name for either) enclosed in [.
and .]
stands for the sequence of characters of that collating element. The sequence is treated as a single element of the bracket expression’s list. This allows a bracket expression containing a multiple-character collating element to match more than one character, e.g., if the collating sequence includes a ch
collating element, then the RE [[.ch.]]*c
matches the first five characters of chchcc
.
Note
PostgreSQL currently does not support multi-character collating elements. This information describes possible future behavior.
Within a bracket expression, a collating element enclosed in [=
and =]
is an equivalence class, standing for the sequences of characters of all collating elements equivalent to that one, including itself. (If there are no other equivalent collating elements, the treatment is as if the enclosing delimiters were [.
and .]
.) For example, if o
and ^
are the members of an equivalence class, then [[=o=]]
, [[=^=]]
, and [o^]
are all synonymous. An equivalence class cannot be an endpoint of a range.
Within a bracket expression, the name of a character class enclosed in [:
and :]
stands for the list of all characters belonging to that class. A character class cannot be used as an endpoint of a range. The POSIX standard defines these character class names: alnum
(letters and numeric digits), alpha
(letters), blank
(space and tab), cntrl
(control characters), digit
(numeric digits), graph
(printable characters except space), lower
(lower-case letters), print
(printable characters including space), punct
(punctuation), space
(any white space), upper
(upper-case letters), and xdigit
(hexadecimal digits). The behavior of these standard character classes is generally consistent across platforms for characters in the 7-bit ASCII set. Whether a given non-ASCII character is considered to belong to one of these classes depends on the collation that is used for the regular-expression function or operator (see Section 24.2), or by default on the database’s LC_CTYPE
locale setting (see Section 24.1). The classification of non-ASCII characters can vary across platforms even in similarly-named locales. (But the C
locale never considers any non-ASCII characters to belong to any of these classes.) In addition to these standard character classes, PostgreSQL defines the word
character class, which is the same as alnum
plus the underscore (_
) character, and the ascii
character class, which contains exactly the 7-bit ASCII set.
There are two special cases of bracket expressions: the bracket expressions [[:<:]]
and [[:>:]]
are constraints, matching empty strings at the beginning and end of a word respectively. A word is defined as a sequence of word characters that is neither preceded nor followed by word characters. A word character is any character belonging to the word
character class, that is, any letter, digit, or underscore. This is an extension, compatible with but not specified by POSIX 1003.2, and should be used with caution in software intended to be portable to other systems. The constraint escapes described below are usually preferable; they are no more standard, but are easier to type.
9.7.3.3. Regular Expression Escapes
Escapes are special sequences beginning with followed by an alphanumeric character. Escapes come in several varieties: character entry, class shorthands, constraint escapes, and back references. A
followed by an alphanumeric character but not constituting a valid escape is illegal in AREs. In EREs, there are no escapes: outside a bracket expression, a
followed by an alphanumeric character merely stands for that character as an ordinary character, and inside a bracket expression,
is an ordinary character. (The latter is the one actual incompatibility between EREs and AREs.)
Character-entry escapes exist to make it easier to specify non-printing and other inconvenient characters in REs. They are shown in Table 9.20.
Class-shorthand escapes provide shorthands for certain commonly-used character classes. They are shown in Table 9.21.
A constraint escape is a constraint, matching the empty string if specific conditions are met, written as an escape. They are shown in Table 9.22.
A back reference (n
) matches the same string matched by the previous parenthesized subexpression specified by the number n
(see Table 9.23). For example, ([bc])1
matches bb
or cc
but not bc
or cb
. The subexpression must entirely precede the back reference in the RE. Subexpressions are numbered in the order of their leading parentheses. Non-capturing parentheses do not define subexpressions. The back reference considers only the string characters matched by the referenced subexpression, not any constraints contained in it. For example, (^d)1
will match 22
.
Table 9.20. Regular Expression Character-Entry Escapes
Escape | Description |
---|---|
a |
alert (bell) character, as in C |
b |
backspace, as in C |
B |
synonym for backslash ( ) to help reduce the need for backslash doubling |
c X |
(where X is any character) the character whose low-order 5 bits are the same as those of X , and whose other bits are all zero |
e |
the character whose collating-sequence name is ESC , or failing that, the character with octal value 033 |
f |
form feed, as in C |
n |
newline, as in C |
r |
carriage return, as in C |
t |
horizontal tab, as in C |
u wxyz |
(where wxyz is exactly four hexadecimal digits) the character whose hexadecimal value is 0x wxyz |
U stuvwxyz |
(where stuvwxyz is exactly eight hexadecimal digits) the character whose hexadecimal value is 0x stuvwxyz |
v |
vertical tab, as in C |
x hhh |
(where hhh is any sequence of hexadecimal digits) the character whose hexadecimal value is 0x hhh (a single character no matter how many hexadecimal digits are used) |
|
the character whose value is 0 (the null byte) |
xy |
(where xy is exactly two octal digits, and is not a back reference) the character whose octal value is 0 xy |
xyz |
(where xyz is exactly three octal digits, and is not a back reference) the character whose octal value is 0 xyz |
Hexadecimal digits are 0
—9
, a
—f
, and A
—F
. Octal digits are 0
—7
.
Numeric character-entry escapes specifying values outside the ASCII range (0–127) have meanings dependent on the database encoding. When the encoding is UTF-8, escape values are equivalent to Unicode code points, for example u1234
means the character U+1234
. For other multibyte encodings, character-entry escapes usually just specify the concatenation of the byte values for the character. If the escape value does not correspond to any legal character in the database encoding, no error will be raised, but it will never match any data.
The character-entry escapes are always taken as ordinary characters. For example, 135
is ]
in ASCII, but 135
does not terminate a bracket expression.
Table 9.21. Regular Expression Class-Shorthand Escapes
Escape | Description |
---|---|
d |
matches any digit, like [[:digit:]] |
s |
matches any whitespace character, like [[:space:]] |
w |
matches any word character, like [[:word:]] |
D |
matches any non-digit, like [^[:digit:]] |
S |
matches any non-whitespace character, like [^[:space:]] |
W |
matches any non-word character, like [^[:word:]] |
The class-shorthand escapes also work within bracket expressions, although the definitions shown above are not quite syntactically valid in that context. For example, [a-cd]
is equivalent to [a-c[:digit:]]
.
Table 9.22. Regular Expression Constraint Escapes
Escape | Description |
---|---|
A |
matches only at the beginning of the string (see Section 9.7.3.5 for how this differs from ^ ) |
m |
matches only at the beginning of a word |
M |
matches only at the end of a word |
y |
matches only at the beginning or end of a word |
Y |
matches only at a point that is not the beginning or end of a word |
Z |
matches only at the end of the string (see Section 9.7.3.5 for how this differs from $ ) |
A word is defined as in the specification of [[:<:]]
and [[:>:]]
above. Constraint escapes are illegal within bracket expressions.
Table 9.23. Regular Expression Back References
Escape | Description |
---|---|
m |
(where m is a nonzero digit) a back reference to the m ‘th subexpression |
mnn |
(where m is a nonzero digit, and nn is some more digits, and the decimal value mnn is not greater than the number of closing capturing parentheses seen so far) a back reference to the mnn ‘th subexpression |
Note
There is an inherent ambiguity between octal character-entry escapes and back references, which is resolved by the following heuristics, as hinted at above. A leading zero always indicates an octal escape. A single non-zero digit, not followed by another digit, is always taken as a back reference. A multi-digit sequence not starting with a zero is taken as a back reference if it comes after a suitable subexpression (i.e., the number is in the legal range for a back reference), and otherwise is taken as octal.
9.7.3.5. Regular Expression Matching Rules
In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, either the longest possible match or the shortest possible match will be taken, depending on whether the RE is greedy or non-greedy.
Whether an RE is greedy or not is determined by the following rules:
-
Most atoms, and all constraints, have no greediness attribute (because they cannot match variable amounts of text anyway).
-
Adding parentheses around an RE does not change its greediness.
-
A quantified atom with a fixed-repetition quantifier (
{
m
}
or{
m
}?
) has the same greediness (possibly none) as the atom itself. -
A quantified atom with other normal quantifiers (including
{
m
,
n
}
withm
equal ton
) is greedy (prefers longest match). -
A quantified atom with a non-greedy quantifier (including
{
m
,
n
}?
withm
equal ton
) is non-greedy (prefers shortest match). -
A branch — that is, an RE that has no top-level
|
operator — has the same greediness as the first quantified atom in it that has a greediness attribute. -
An RE consisting of two or more branches connected by the
|
operator is always greedy.
The above rules associate greediness attributes not only with individual quantified atoms, but with branches and entire REs that contain quantified atoms. What that means is that the matching is done in such a way that the branch, or whole RE, matches the longest or shortest possible substring as a whole. Once the length of the entire match is determined, the part of it that matches any particular subexpression is determined on the basis of the greediness attribute of that subexpression, with subexpressions starting earlier in the RE taking priority over ones starting later.
An example of what this means:
SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})'); Result:123
SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})'); Result:1
In the first case, the RE as a whole is greedy because Y*
is greedy. It can match beginning at the Y
, and it matches the longest possible string starting there, i.e., Y123
. The output is the parenthesized part of that, or 123
. In the second case, the RE as a whole is non-greedy because Y*?
is non-greedy. It can match beginning at the Y
, and it matches the shortest possible string starting there, i.e., Y1
. The subexpression [0-9]{1,3}
is greedy but it cannot change the decision as to the overall match length; so it is forced to match just 1
.
In short, when an RE contains both greedy and non-greedy subexpressions, the total match length is either as long as possible or as short as possible, according to the attribute assigned to the whole RE. The attributes assigned to the subexpressions only affect how much of that match they are allowed to “eat” relative to each other.
The quantifiers {1,1}
and {1,1}?
can be used to force greediness or non-greediness, respectively, on a subexpression or a whole RE. This is useful when you need the whole RE to have a greediness attribute different from what’s deduced from its elements. As an example, suppose that we are trying to separate a string containing some digits into the digits and the parts before and after them. We might try to do that like this:
SELECT regexp_match('abc01234xyz', '(.*)(d+)(.*)');
Result: {abc0123,4,xyz}
That didn’t work: the first .*
is greedy so it “eats” as much as it can, leaving the d+
to match at the last possible place, the last digit. We might try to fix that by making it non-greedy:
SELECT regexp_match('abc01234xyz', '(.*?)(d+)(.*)');
Result: {abc,0,""}
That didn’t work either, because now the RE as a whole is non-greedy and so it ends the overall match as soon as possible. We can get what we want by forcing the RE as a whole to be greedy:
SELECT regexp_match('abc01234xyz', '(?:(.*?)(d+)(.*)){1,1}');
Result: {abc,01234,xyz}
Controlling the RE’s overall greediness separately from its components’ greediness allows great flexibility in handling variable-length patterns.
When deciding what is a longer or shorter match, match lengths are measured in characters, not collating elements. An empty string is considered longer than no match at all. For example: bb*
matches the three middle characters of abbbc
; (week|wee)(night|knights)
matches all ten characters of weeknights
; when (.*).*
is matched against abc
the parenthesized subexpression matches all three characters; and when (a*)*
is matched against bc
both the whole RE and the parenthesized subexpression match an empty string.
If case-independent matching is specified, the effect is much as if all case distinctions had vanished from the alphabet. When an alphabetic that exists in multiple cases appears as an ordinary character outside a bracket expression, it is effectively transformed into a bracket expression containing both cases, e.g., x
becomes [xX]
. When it appears inside a bracket expression, all case counterparts of it are added to the bracket expression, e.g., [x]
becomes [xX]
and [^x]
becomes [^xX]
.
If newline-sensitive matching is specified, .
and bracket expressions using ^
will never match the newline character (so that matches will not cross lines unless the RE explicitly includes a newline) and ^
and $
will match the empty string after and before a newline respectively, in addition to matching at beginning and end of string respectively. But the ARE escapes A
and Z
continue to match beginning or end of string only. Also, the character class shorthands D
and W
will match a newline regardless of this mode. (Before PostgreSQL 14, they did not match newlines when in newline-sensitive mode. Write [^[:digit:]]
or [^[:word:]]
to get the old behavior.)
If partial newline-sensitive matching is specified, this affects .
and bracket expressions as with newline-sensitive matching, but not ^
and $
.
If inverse partial newline-sensitive matching is specified, this affects ^
and $
as with newline-sensitive matching, but not .
and bracket expressions. This isn’t very useful but is provided for symmetry.
9.7.3.6. Limits and Compatibility
No particular limit is imposed on the length of REs in this implementation. However, programs intended to be highly portable should not employ REs longer than 256 bytes, as a POSIX-compliant implementation can refuse to accept such REs.
The only feature of AREs that is actually incompatible with POSIX EREs is that does not lose its special significance inside bracket expressions. All other ARE features use syntax which is illegal or has undefined or unspecified effects in POSIX EREs; the
***
syntax of directors likewise is outside the POSIX syntax for both BREs and EREs.
Many of the ARE extensions are borrowed from Perl, but some have been changed to clean them up, and a few Perl extensions are not present. Incompatibilities of note include b
, B
, the lack of special treatment for a trailing newline, the addition of complemented bracket expressions to the things affected by newline-sensitive matching, the restrictions on parentheses and back references in lookahead/lookbehind constraints, and the longest/shortest-match (rather than first-match) matching semantics.
9.7.3.7. Basic Regular Expressions
BREs differ from EREs in several respects. In BREs, |
, +
, and ?
are ordinary characters and there is no equivalent for their functionality. The delimiters for bounds are {
and }
, with {
and }
by themselves ordinary characters. The parentheses for nested subexpressions are (
and )
, with (
and )
by themselves ordinary characters. ^
is an ordinary character except at the beginning of the RE or the beginning of a parenthesized subexpression, $
is an ordinary character except at the end of the RE or the end of a parenthesized subexpression, and *
is an ordinary character if it appears at the beginning of the RE or the beginning of a parenthesized subexpression (after a possible leading ^
). Finally, single-digit back references are available, and <
and >
are synonyms for [[:<:]]
and [[:>:]]
respectively; no other escapes are available in BREs.
9.7.3.8. Differences from SQL Standard and XQuery
Since SQL:2008, the SQL standard includes regular expression operators and functions that performs pattern matching according to the XQuery regular expression standard:
-
LIKE_REGEX
-
OCCURRENCES_REGEX
-
POSITION_REGEX
-
SUBSTRING_REGEX
-
TRANSLATE_REGEX
PostgreSQL does not currently implement these operators and functions. You can get approximately equivalent functionality in each case as shown in Table 9.25. (Various optional clauses on both sides have been omitted in this table.)
Table 9.25. Regular Expression Functions Equivalencies
SQL standard | PostgreSQL |
---|---|
|
regexp_like( or
|
OCCURRENCES_REGEX( |
regexp_count( |
POSITION_REGEX( |
regexp_instr( |
SUBSTRING_REGEX( |
regexp_substr( |
TRANSLATE_REGEX( |
regexp_replace( |
Regular expression functions similar to those provided by PostgreSQL are also available in a number of other SQL implementations, whereas the SQL-standard functions are not as widely implemented. Some of the details of the regular expression syntax will likely differ in each implementation.
The SQL-standard operators and functions use XQuery regular expressions, which are quite close to the ARE syntax described above. Notable differences between the existing POSIX-based regular-expression feature and XQuery regular expressions include:
-
XQuery character class subtraction is not supported. An example of this feature is using the following to match only English consonants:
[a-z-[aeiou]]
. -
XQuery character class shorthands
c
,C
,i
, andI
are not supported. -
XQuery character class elements using
p{UnicodeProperty}
or the inverseP{UnicodeProperty}
are not supported. -
POSIX interprets character classes such as
w
(see Table 9.21) according to the prevailing locale (which you can control by attaching aCOLLATE
clause to the operator or function). XQuery specifies these classes by reference to Unicode character properties, so equivalent behavior is obtained only with a locale that follows the Unicode rules. -
The SQL standard (not XQuery itself) attempts to cater for more variants of “newline” than POSIX does. The newline-sensitive matching options described above consider only ASCII NL (
n
) to be a newline, but SQL would have us treat CR (r
), CRLF (rn
) (a Windows-style newline), and some Unicode-only characters like LINE SEPARATOR (U+2028) as newlines as well. Notably,.
ands
should countrn
as one character not two according to SQL. -
Of the character-entry escapes described in Table 9.20, XQuery supports only
n
,r
, andt
. -
XQuery does not support the
[:
syntax for character classes within bracket expressions.name
:] -
XQuery does not have lookahead or lookbehind constraints, nor any of the constraint escapes described in Table 9.22.
-
The metasyntax forms described in Section 9.7.3.4 do not exist in XQuery.
-
The regular expression flag letters defined by XQuery are related to but not the same as the option letters for POSIX (Table 9.24). While the
i
andq
options behave the same, others do not:-
XQuery’s
s
(allow dot to match newline) andm
(allow^
and$
to match at newlines) flags provide access to the same behaviors as POSIX’sn
,p
andw
flags, but they do not match the behavior of POSIX’ss
andm
flags. Note in particular that dot-matches-newline is the default behavior in POSIX but not XQuery. -
XQuery’s
x
(ignore whitespace in pattern) flag is noticeably different from POSIX’s expanded-mode flag. POSIX’sx
flag also allows#
to begin a comment in the pattern, and POSIX will not ignore a whitespace character after a backslash.
-
text
^@
text
→ boolean
Returns true if the first string starts with the second string (equivalent to the starts_with()
function).
'alphabet' ^@ 'alph'
→ t
ascii
( text
) → integer
Returns the numeric code of the first character of the argument. In UTF8 encoding, returns the Unicode code point of the character. In other multibyte encodings, the argument must be an ASCII character.
ascii('x')
→ 120
btrim
( string
text
[, characters
text
] ) → text
Removes the longest string containing only characters in characters
(a space by default) from the start and end of string
.
btrim('xyxtrimyyx', 'xyz')
→ trim
chr
( integer
) → text
Returns the character with the given code. In UTF8 encoding the argument is treated as a Unicode code point. In other multibyte encodings the argument must designate an ASCII character. chr(0)
is disallowed because text data types cannot store that character.
chr(65)
→ A
concat
( val1
"any"
[, val2
"any"
[, …] ] ) → text
Concatenates the text representations of all the arguments. NULL arguments are ignored.
concat('abcde', 2, NULL, 22)
→ abcde222
concat_ws
( sep
text
, val1
"any"
[, val2
"any"
[, …] ] ) → text
Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
concat_ws(',', 'abcde', 2, NULL, 22)
→ abcde,2,22
format
( formatstr
text
[, formatarg
"any"
[, …] ] ) → text
Formats arguments according to a format string; see Section 9.4.1. This function is similar to the C function sprintf
.
format('Hello %s, %1$s', 'World')
→ Hello World, World
initcap
( text
) → text
Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
initcap('hi THOMAS')
→ Hi Thomas
left
( string
text
, n
integer
) → text
Returns first n
characters in the string, or when n
is negative, returns all but last |n
| characters.
left('abcde', 2)
→ ab
length
( text
) → integer
Returns the number of characters in the string.
length('jose')
→ 4
lpad
( string
text
, length
integer
[, fill
text
] ) → text
Extends the string
to length length
by prepending the characters fill
(a space by default). If the string
is already longer than length
then it is truncated (on the right).
lpad('hi', 5, 'xy')
→ xyxhi
ltrim
( string
text
[, characters
text
] ) → text
Removes the longest string containing only characters in characters
(a space by default) from the start of string
.
ltrim('zzzytest', 'xyz')
→ test
md5
( text
) → text
Computes the MD5 hash of the argument, with the result written in hexadecimal.
md5('abc')
→ 900150983cd24fb0d6963f7d28e17f72
parse_ident
( qualified_identifier
text
[, strict_mode
boolean
DEFAULT
true
] ) → text[]
Splits qualified_identifier
into an array of identifiers, removing any quoting of individual identifiers. By default, extra characters after the last identifier are considered an error; but if the second parameter is false
, then such extra characters are ignored. (This behavior is useful for parsing names for objects like functions.) Note that this function does not truncate over-length identifiers. If you want truncation you can cast the result to name[]
.
parse_ident('"SomeSchema".someTable')
→ {SomeSchema,sometable}
pg_client_encoding
( ) → name
Returns current client encoding name.
pg_client_encoding()
→ UTF8
quote_ident
( text
) → text
Returns the given string suitably quoted to be used as an identifier in an SQL statement string. Quotes are added only if necessary (i.e., if the string contains non-identifier characters or would be case-folded). Embedded quotes are properly doubled. See also Example 43.1.
quote_ident('Foo bar')
→ "Foo bar"
quote_literal
( text
) → text
Returns the given string suitably quoted to be used as a string literal in an SQL statement string. Embedded single-quotes and backslashes are properly doubled. Note that quote_literal
returns null on null input; if the argument might be null, quote_nullable
is often more suitable. See also Example 43.1.
quote_literal(E'O'Reilly')
→ 'O''Reilly'
quote_literal
( anyelement
) → text
Converts the given value to text and then quotes it as a literal. Embedded single-quotes and backslashes are properly doubled.
quote_literal(42.5)
→ '42.5'
quote_nullable
( text
) → text
Returns the given string suitably quoted to be used as a string literal in an SQL statement string; or, if the argument is null, returns NULL
. Embedded single-quotes and backslashes are properly doubled. See also Example 43.1.
quote_nullable(NULL)
→ NULL
quote_nullable
( anyelement
) → text
Converts the given value to text and then quotes it as a literal; or, if the argument is null, returns NULL
. Embedded single-quotes and backslashes are properly doubled.
quote_nullable(42.5)
→ '42.5'
regexp_count
( string
text
, pattern
text
[, start
integer
[, flags
text
] ] ) → integer
Returns the number of times the POSIX regular expression pattern
matches in the string
; see Section 9.7.3.
regexp_count('123456789012', 'ddd', 2)
→ 3
regexp_instr
( string
text
, pattern
text
[, start
integer
[, N
integer
[, endoption
integer
[, flags
text
[, subexpr
integer
] ] ] ] ] ) → integer
Returns the position within string
where the N
‘th match of the POSIX regular expression pattern
occurs, or zero if there is no such match; see Section 9.7.3.
regexp_instr('ABCDEF', 'c(.)(..)', 1, 1, 0, 'i')
→ 3
regexp_instr('ABCDEF', 'c(.)(..)', 1, 1, 0, 'i', 2)
→ 5
regexp_like
( string
text
, pattern
text
[, flags
text
] ) → boolean
Checks whether a match of the POSIX regular expression pattern
occurs within string
; see Section 9.7.3.
regexp_like('Hello World', 'world$', 'i')
→ t
regexp_match
( string
text
, pattern
text
[, flags
text
] ) → text[]
Returns substrings within the first match of the POSIX regular expression pattern
to the string
; see Section 9.7.3.
regexp_match('foobarbequebaz', '(bar)(beque)')
→ {bar,beque}
regexp_matches
( string
text
, pattern
text
[, flags
text
] ) → setof text[]
Returns substrings within the first match of the POSIX regular expression pattern
to the string
, or substrings within all such matches if the g
flag is used; see Section 9.7.3.
regexp_matches('foobarbequebaz', 'ba.', 'g')
→
{bar} {baz}
regexp_replace
( string
text
, pattern
text
, replacement
text
[, start
integer
] [, flags
text
] ) → text
Replaces the substring that is the first match to the POSIX regular expression pattern
, or all such matches if the g
flag is used; see Section 9.7.3.
regexp_replace('Thomas', '.[mN]a.', 'M')
→ ThM
regexp_replace
( string
text
, pattern
text
, replacement
text
, start
integer
, N
integer
[, flags
text
] ) → text
Replaces the substring that is the N
‘th match to the POSIX regular expression pattern
, or all such matches if N
is zero; see Section 9.7.3.
regexp_replace('Thomas', '.', 'X', 3, 2)
→ ThoXas
regexp_split_to_array
( string
text
, pattern
text
[, flags
text
] ) → text[]
Splits string
using a POSIX regular expression as the delimiter, producing an array of results; see Section 9.7.3.
regexp_split_to_array('hello world', 's+')
→ {hello,world}
regexp_split_to_table
( string
text
, pattern
text
[, flags
text
] ) → setof text
Splits string
using a POSIX regular expression as the delimiter, producing a set of results; see Section 9.7.3.
regexp_split_to_table('hello world', 's+')
→
hello world
regexp_substr
( string
text
, pattern
text
[, start
integer
[, N
integer
[, flags
text
[, subexpr
integer
] ] ] ] ) → text
Returns the substring within string
that matches the N
‘th occurrence of the POSIX regular expression pattern
, or NULL
if there is no such match; see Section 9.7.3.
regexp_substr('ABCDEF', 'c(.)(..)', 1, 1, 'i')
→ CDEF
regexp_substr('ABCDEF', 'c(.)(..)', 1, 1, 'i', 2)
→ EF
repeat
( string
text
, number
integer
) → text
Repeats string
the specified number
of times.
repeat('Pg', 4)
→ PgPgPgPg
replace
( string
text
, from
text
, to
text
) → text
Replaces all occurrences in string
of substring from
with substring to
.
replace('abcdefabcdef', 'cd', 'XX')
→ abXXefabXXef
reverse
( text
) → text
Reverses the order of the characters in the string.
reverse('abcde')
→ edcba
right
( string
text
, n
integer
) → text
Returns last n
characters in the string, or when n
is negative, returns all but first |n
| characters.
right('abcde', 2)
→ de
rpad
( string
text
, length
integer
[, fill
text
] ) → text
Extends the string
to length length
by appending the characters fill
(a space by default). If the string
is already longer than length
then it is truncated.
rpad('hi', 5, 'xy')
→ hixyx
rtrim
( string
text
[, characters
text
] ) → text
Removes the longest string containing only characters in characters
(a space by default) from the end of string
.
rtrim('testxxzx', 'xyz')
→ test
split_part
( string
text
, delimiter
text
, n
integer
) → text
Splits string
at occurrences of delimiter
and returns the n
‘th field (counting from one), or when n
is negative, returns the |n
|’th-from-last field.
split_part('abc~@~def~@~ghi', '~@~', 2)
→ def
split_part('abc,def,ghi,jkl', ',', -2)
→ ghi
starts_with
( string
text
, prefix
text
) → boolean
Returns true if string
starts with prefix
.
starts_with('alphabet', 'alph')
→ t
string_to_array
( string
text
, delimiter
text
[, null_string
text
] ) → text[]
Splits the string
at occurrences of delimiter
and forms the resulting fields into a text
array. If delimiter
is NULL
, each character in the string
will become a separate element in the array. If delimiter
is an empty string, then the string
is treated as a single field. If null_string
is supplied and is not NULL
, fields matching that string are replaced by NULL
. See also array_to_string
.
string_to_array('xx~~yy~~zz', '~~', 'yy')
→ {xx,NULL,zz}
string_to_table
( string
text
, delimiter
text
[, null_string
text
] ) → setof text
Splits the string
at occurrences of delimiter
and returns the resulting fields as a set of text
rows. If delimiter
is NULL
, each character in the string
will become a separate row of the result. If delimiter
is an empty string, then the string
is treated as a single field. If null_string
is supplied and is not NULL
, fields matching that string are replaced by NULL
.
string_to_table('xx~^~yy~^~zz', '~^~', 'yy')
→
xx NULL zz
strpos
( string
text
, substring
text
) → integer
Returns first starting index of the specified substring
within string
, or zero if it’s not present. (Same as position(
, but note the reversed argument order.)substring
in string
)
strpos('high', 'ig')
→ 2
substr
( string
text
, start
integer
[, count
integer
] ) → text
Extracts the substring of string
starting at the start
‘th character, and extending for count
characters if that is specified. (Same as substring(
.)string
from start
for count
)
substr('alphabet', 3)
→ phabet
substr('alphabet', 3, 2)
→ ph
to_ascii
( string
text
) → text
to_ascii
( string
text
, encoding
name
) → text
to_ascii
( string
text
, encoding
integer
) → text
Converts string
to ASCII from another encoding, which may be identified by name or number. If encoding
is omitted the database encoding is assumed (which in practice is the only useful case). The conversion consists primarily of dropping accents. Conversion is only supported from LATIN1
, LATIN2
, LATIN9
, and WIN1250
encodings. (See the unaccent module for another, more flexible solution.)
to_ascii('Karél')
→ Karel
to_hex
( integer
) → text
to_hex
( bigint
) → text
Converts the number to its equivalent hexadecimal representation.
to_hex(2147483647)
→ 7fffffff
translate
( string
text
, from
text
, to
text
) → text
Replaces each character in string
that matches a character in the from
set with the corresponding character in the to
set. If from
is longer than to
, occurrences of the extra characters in from
are deleted.
translate('12345', '143', 'ax')
→ a2x5
unistr
( text
) → text
Evaluate escaped Unicode characters in the argument. Unicode characters can be specified as
(4 hexadecimal digits), XXXX
+
(6 hexadecimal digits), XXXXXX
u
(4 hexadecimal digits), or XXXX
U
(8 hexadecimal digits). To specify a backslash, write two backslashes. All other characters are taken literally.XXXXXXXX
If the server encoding is not UTF-8, the Unicode code point identified by one of these escape sequences is converted to the actual server encoding; an error is reported if that’s not possible.
This function provides a (non-standard) alternative to string constants with Unicode escapes (see Section 4.1.2.3).
unistr('d061t+000061')
→ data
unistr('du0061tU00000061')
→ data
В этом учебном пособии вы узнаете, как использовать PostgreSQL условие LIKE с синтаксисом и примерами.
Описание
PostgreSQL условие LIKE позволяет использовать подстановочные символы (метасимволы) в предложении WHERE оператора SELECT, INSERT, UPDATE или DELETE. Это позволяет выполнять сопоставление с pattern.
Синтаксис
Синтаксис для условия LIKE в PostgreSQL:
expression LIKE pattern [ ESCAPE ‘escape_character’ ]
Параметры или аргументы
- expression
- Символьное выражение, такое как поле или столбец.
- pattern
- Символьное выражение, которое содержит сопоставляемый шаблон. Шаблоны, которые вы можете выбрать из:
Подстановочный символ Пояснение % Соответствует любой строке любой длины (в том числе нулевой длины) _ Соответствует одному символу - escape_character
- Необязательный. Это позволяет вам проверять наличие буквенных символов, таких как % или _. Если вы не предоставите escape_character, PostgreSQL предполагает, что
является escape_character.
Пример — использование подстановочного символа % (символ знак процента)
Первый пример PostgreSQL LIKE, который мы рассмотрим, включает использование подстановочного символа % (символ знак процента).
Давайте разберем, как подстановочный символ % работает в PostgreSQL условии LIKE. Мы хотим найти всех employees, last_name начинается с ‘Jo’.
SELECT * FROM employees WHERE first_name LIKE ‘Jo%’; |
Вы также можете использовать подстановочный символ % несколько раз в одной строке. Например,
SELECT * FROM employees WHERE first_name LIKE ‘%od%’; |
В этом PostgreSQL примере условия LIKE мы ищем всех employees, чье first_name содержит символы ‘od’.
Пример — использование подстановочного символа _ (символ подчеркивание)
Далее, давайте рассмотрим, как подстановочный символ _ (символ подчеркивания) работает в PostgreSQL условии LIKE. Помните, что подстановочный символ _ ищет только один символ.
Например:
SELECT first_name, last_name FROM employees WHERE first_name LIKE ‘Yoh_n’; |
Этот пример PostgreSQL условия LIKE вернул бы всех suppliers, чье supplier_name имеет длину 5 символов, где первые три символа — «Yoh», а последний — «n». Например, он может возвращать записи таблицы employees, у которых first_name — «Yohan», «Yohen», «Yohin», «Yohon» и т.д.
Вот еще один пример:
SELECT * FROM employees WHERE employee_number LIKE ‘98765_’; |
Вы можете обнаружить, что ищете номер счета, но у вас есть только 5 из 6 цифр. В приведенном выше примере можно было бы получить обратно 10 записей (где отсутствующее значение может быть равно 0–9). Например, он может вернуть записи таблицы employees с employee_number:
987650, 987651, 987652, 987653, 987654, 987655, 987656, 987657, 987658, 987659
Пример — использование оператора NOT
Теперь давайте посмотрим, как вы можете использовать оператор NOT с подстановочными символами.
Давайте использовать подстановочный символ % с оператором NOT. Вы также можете использовать PostgreSQL условие LIKE, чтобы найти записи таблицы employees, чье last_name не начинается с ‘J’.
Например:
SELECT first_name, last_name FROM employees WHERE last_name NOT LIKE ‘J%’; |
Поместив оператор NOT перед PostgreSQL условием LIKE, вы сможете получить все записи employees, чье last_name не начинается с ‘J’.
Пример — использование Escape-символов
Важно понимать, как «экранировать символы» когда pattern соответствует. Эти примеры имеют дело с экранированием символов в PostgreSQL.
Допустим, вы хотели найти символ % или _ в условии PostgreSQL LIKE. Вы можете сделать это с помощью символа Escape.
Обратите внимание, что вы можете определить только escape-символ как один символ (длина 1).
Например:
SELECT * FROM employees WHERE last_name LIKE ‘G%’; |
Поскольку мы не указали escape-символ, PostgreSQL предполагает, что это escape-символ. Затем PostgreSQL предполагает, что escape-символ является в результате PostgreSQL рассматривает символ % как литерал вместо подстановочного символа. Этот оператор затем возвращает все записи из employees, чье last_name равно G%.
Мы можем переопределить escape-символ по умолчанию в PostgreSQL, предоставив модификатор ESCAPE следующим образом:
SELECT * FROM employees WHERE last_name LIKE ‘G!%’ ESCAPE ‘!’; |
Этот пример PostgreSQL условия LIKE определяет ! как escape-символ. ! escape-символ приведет к тому, что PostgreSQL будет рассматривать символ % как литерал. В результате этот оператор также вернет все записи из таблицы employees, last_name которых равно G%.
Вот еще один более сложный пример использования escape-символов в PostgreSQL условии LIKE.
SELECT * FROM employees WHERE last_name LIKE ‘M%%’; |
Этот пример PostgreSQL условия LIKE возвращает всех employees, чье last_name начинается с ‘M’ и заканчивается на ‘%’. Например, он вернул бы значение, такое как «Mathison%». Поскольку в условии LIKE мы не указывали escape-символ, PostgreSQL предполагает, что escape-символ является что приводит к тому, что PostgreSQL рассматривает второй символ % как литерал вместо подстановочного символа.
Мы могли бы изменить это условие LIKE, указав escape-символ следующим образом:
SELECT * FROM employees WHERE last_name LIKE ‘M%!%’ ESCAPE ‘!’; |
Этот пример PostgreSQL условия LIKE возвращает все записи из employees, чье last_name начинается с ‘M’ и заканчивается литеральным ‘%’. Например, он будет возвращать значение, такое как «Mathison%».
Вы также можете использовать escape-символ с символом _ в условии PostgreSQL LIKE.
Например:
SELECT * FROM employees WHERE last_name LIKE ‘M%_’; |
Опять же, поскольку модификатор ESCAPE не предоставлен, PostgreSQL использует как escape-символ, приводящий к символу _, который будет трактоваться как литерал вместо подстановочного символа. В этом примере будут возвращаться все поля из employees, чье last_name начинается с ‘M’ и заканчивается на ‘_’. Например, он вернул бы значение, такое как ‘Mathison_’.
Function | Return Type | Description | Example | Result |
---|---|---|---|---|
|
int |
ASCII code of the first character of the argument. For UTF8 returns the Unicode code point of the character. For other multibyte encodings, the argument must be an ASCII character. | ascii('x') |
120 |
|
text |
Remove the longest string consisting only of characters in characters (a space by default) from the start and end of string |
btrim('xyxtrimyyx', 'xyz') |
trim |
|
text |
Character with the given code. For UTF8 the argument is treated as a Unicode code point. For other multibyte encodings the argument must designate an ASCII character. The NULL (0) character is not allowed because text data types cannot store such bytes. | chr(65) |
A |
|
text |
Concatenate the text representations of all the arguments. NULL arguments are ignored. | concat('abcde', 2, NULL, 22) |
abcde222 |
|
text |
Concatenate all but the first argument with separators. The first argument is used as the separator string. NULL arguments are ignored. | concat_ws(',', 'abcde', 2, NULL, 22) |
abcde,2,22 |
|
bytea |
Convert string to dest_encoding . The original encoding is specified by src_encoding . The string must be valid in this encoding. Conversions can be defined by CREATE CONVERSION . Also there are some predefined conversions. See Table 9.10 for available conversions. |
convert('text_in_utf8', 'UTF8', 'LATIN1') |
text_in_utf8 represented in Latin-1 encoding (ISO 8859-1) |
|
text |
Convert string to the database encoding. The original encoding is specified by src_encoding . The string must be valid in this encoding. |
convert_from('text_in_utf8', 'UTF8') |
text_in_utf8 represented in the current database encoding |
|
bytea |
Convert string to dest_encoding . |
convert_to('some text', 'UTF8') |
some text represented in the UTF8 encoding |
|
bytea |
Decode binary data from textual representation in string . Options for format are same as in encode . |
decode('MTIzAAE=', 'base64') |
x3132330001 |
|
text |
Encode binary data into a textual representation. Supported formats are: base64 , hex , escape . escape converts zero bytes and high-bit-set bytes to octal sequences (
nnn ) and doubles backslashes. |
encode('1230001', 'base64') |
MTIzAAE= |
|
text |
Format arguments according to a format string. This function is similar to the C function sprintf . See Section 9.4.1. |
format('Hello %s, %1$s', 'World') |
Hello World, World |
|
text |
Convert the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters. | initcap('hi THOMAS') |
Hi Thomas |
|
text |
Return first n characters in the string. When n is negative, return all but last |n | characters. |
left('abcde', 2) |
ab |
|
int |
Number of characters in string |
length('jose') |
4 |
|
int |
Number of characters in string in the given encoding . The string must be valid in this encoding. |
length('jose', 'UTF8') |
4 |
|
text |
Fill up the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right). |
lpad('hi', 5, 'xy') |
xyxhi |
|
text |
Remove the longest string containing only characters from characters (a space by default) from the start of string |
ltrim('zzzytest', 'xyz') |
test |
|
text |
Calculates the MD5 hash of string , returning the result in hexadecimal |
md5('abc') |
900150983cd24fb0 d6963f7d28e17f72 |
|
text[] |
Split qualified_identifier into an array of identifiers, removing any quoting of individual identifiers. By default, extra characters after the last identifier are considered an error; but if the second parameter is false , then such extra characters are ignored. (This behavior is useful for parsing names for objects like functions.) Note that this function does not truncate over-length identifiers. If you want truncation you can cast the result to name[] . |
parse_ident('"SomeSchema".someTable') |
{SomeSchema,sometable} |
|
name |
Current client encoding name | pg_client_encoding() |
SQL_ASCII |
|
text |
Return the given string suitably quoted to be used as an identifier in an SQL statement string. Quotes are added only if necessary (i.e., if the string contains non-identifier characters or would be case-folded). Embedded quotes are properly doubled. See also Example 43.1. | quote_ident('Foo bar') |
"Foo bar" |
|
text |
Return the given string suitably quoted to be used as a string literal in an SQL statement string. Embedded single-quotes and backslashes are properly doubled. Note that quote_literal returns null on null input; if the argument might be null, quote_nullable is often more suitable. See also Example 43.1. |
quote_literal(E'O'Reilly') |
'O''Reilly' |
|
text |
Coerce the given value to text and then quote it as a literal. Embedded single-quotes and backslashes are properly doubled. | quote_literal(42.5) |
'42.5' |
|
text |
Return the given string suitably quoted to be used as a string literal in an SQL statement string; or, if the argument is null, return NULL . Embedded single-quotes and backslashes are properly doubled. See also Example 43.1. |
quote_nullable(NULL) |
NULL |
|
text |
Coerce the given value to text and then quote it as a literal; or, if the argument is null, return NULL . Embedded single-quotes and backslashes are properly doubled. |
quote_nullable(42.5) |
'42.5' |
|
text[] |
Return captured substring(s) resulting from the first match of a POSIX regular expression to the string . See Section 9.7.3 for more information. |
regexp_match('foobarbequebaz', '(bar)(beque)') |
{bar,beque} |
|
setof text[] |
Return captured substring(s) resulting from matching a POSIX regular expression to the string . See Section 9.7.3 for more information. |
regexp_matches('foobarbequebaz', 'ba.', 'g') |
{bar}
(2 rows) |
|
text |
Replace substring(s) matching a POSIX regular expression. See Section 9.7.3 for more information. | regexp_replace('Thomas', '.[mN]a.', 'M') |
ThM |
|
text[] |
Split string using a POSIX regular expression as the delimiter. See Section 9.7.3 for more information. |
regexp_split_to_array('hello world', 's+') |
{hello,world} |
|
setof text |
Split string using a POSIX regular expression as the delimiter. See Section 9.7.3 for more information. |
regexp_split_to_table('hello world', 's+') |
hello
(2 rows) |
|
text |
Repeat string the specified number of times |
repeat('Pg', 4) |
PgPgPgPg |
|
text |
Replace all occurrences in string of substring from with substring to |
replace('abcdefabcdef', 'cd', 'XX') |
abXXefabXXef |
|
text |
Return reversed string. | reverse('abcde') |
edcba |
|
text |
Return last n characters in the string. When n is negative, return all but first |n | characters. |
right('abcde', 2) |
de |
|
text |
Fill up the string to length length by appending the characters fill (a space by default). If the string is already longer than length then it is truncated. |
rpad('hi', 5, 'xy') |
hixyx |
|
text |
Remove the longest string containing only characters from characters (a space by default) from the end of string |
rtrim('testxxzx', 'xyz') |
test |
|
text |
Split string on delimiter and return the given field (counting from one) |
split_part('abc~@~def~@~ghi', '~@~', 2) |
def |
|
int |
Location of specified substring (same as position( , but note the reversed argument order) |
strpos('high', 'ig') |
2 |
|
text |
Extract substring (same as substring( ) |
substr('alphabet', 3, 2) |
ph |
|
bool |
Returns true if string starts with prefix . |
starts_with('alphabet', 'alph') |
t |
|
text |
Convert string to ASCII from another encoding (only supports conversion from LATIN1 , LATIN2 , LATIN9 , and WIN1250 encodings) |
to_ascii('Karel') |
Karel |
|
text |
Convert number to its equivalent hexadecimal representation |
to_hex(2147483647) |
7fffffff |
|
text |
Any character in string that matches a character in the from set is replaced by the corresponding character in the to set. If from is longer than to , occurrences of the extra characters in from are removed. |
translate('12345', '143', 'ax') |
a2x5 |
- Use the
like
Operator toSELECT
if String Contains a Substring Match in PostgreSQL - Use the
position()
Function toSELECT
if String Contains a Substring Match in PostgreSQL - Use the
similar to
Regular Expression toSELECT
if String Contains a Substring Match in PostgreSQL - Use the
substring()
Function toSELECT
if String Contains a Substring Match in PostgreSQL - Use the
Posix
Regular Operators toSELECT
if String Contains a Substring Match in PostgreSQL
Today, we will learn how to find a value in the PostgreSQL string and select it if it matches a specific condition.
Suppose you have a string carabc
and want to see if it contains the value car
or not. Thus, you would try to use a function that would tell you if such a substring exists in a string or not.
PostgreSQL provides three methods: like
, similar to
, and the Posix
operator. We will discuss these in detail.
Use the like
Operator to SELECT
if String Contains a Substring Match in PostgreSQL
The like
expression will return True if the string contains the specific sub-string; else, False. First, we create a table and insert some values into it:
create table strings(
str TEXT PRIMARY KEY
);
insert into strings values ('abcd'), ('abgh'), ('carabc');
Now, let’s select the abc
values from this table.
select * from strings where str like 'abc';
When we run this, nothing happens. An empty table is displayed.
Why didn’t it return abcd
and carabc
from the table? Take a look at the statement below and try to see what happens:
select * from strings where str like 'abcd';
Output:
Thus, now you see that calling like
tends to return results that are the exact copy of the string we are trying to match.
To match it to a substring, you need to add a %
to the substring you are trying to match.
The following provides a better example of how to do that:
select * from strings where str like '%abc%';
Output:
The %x
tells the query that we need to find the x
in the string after some other values are located behind this x
. If we wanted to find carabc
, we would say %abc
, and it would return carabc
as car
is appended before the abc
.
But if we wanted to return abcd
, we would use x%
as d
is appended after the x
. Try this out yourself and see the results to get a better understanding!
Because we wanted to check if a string contains abc
that could be either at the left, right or the middle, we used %abc%
to return all such strings.
The like
operator can also be written as ~~
and if not like
, then use !~~
.
Another expression is known as Ilike
helps match strings without any case sensitivity. Meaning that if we use the statement:
select * from strings where str Ilike '%Abc%';
It would return the previous results even though abc
is not equal to Abc
as the a
is capitalized in the substring.
Like the like
operator, you can write Ilike
as ~~*
and not Ilike
as !~~*
.
Suppose you want to run a query to check the records in a table that match a particular string. If you want to compare abcde
to the records in the table and return the row to which it matches, you can use:
select * from strings where 'abcd' like '%' || str || '%'
This would append the rows inside the str
column to the syntax of %_str_%
, and then we can compare each value inside and see which ones match.
Output:
To match a sub-string with some rogue character such as escape
, %
, or backslash /,
, you can use,
select * from strings where 'abcd' like '%/abc'
This would return nothing as the function fails to eliminate the value %
. This takes us to our next point, using the position
function.
Use the position()
Function to SELECT
if String Contains a Substring Match in PostgreSQL
The position()
function is a better alternative for checking if a substring exists within a string. This is defined under the operations for the string in the PostgreSQL documentation.
It returns the index of the found substring inside the main string. So if we were to find car
in carabc
, it would return 1.
In this way, we can see if any sub-string is present inside a string or not by checking the value of the returned index. If it is greater than 0, the sub-string exists; else, it doesn’t.
select * from strings where position(str in 'abcde') > 0
The above statement would return two values, a
and abcd
, as they both are present in abcde
. You can manipulate this as needed.
Also, if you have any other character inside your sub-string such as %
, it would skip that in the check and return accurate results, making it far better than the like
expression.
select * from strings where position(str in '% abcde') > 0
Running the above will give the same results.
A possible replacement for the position
expression can be strpos
, which similarly works.
Use the similar to
Regular Expression to SELECT
if String Contains a Substring Match in PostgreSQL
The only difference between like
and similar to
is that the latter is a standard SQL definition used in various other DMBS.
select * from strings where position(str in 'abcde') > 0
This will return the same result as the like
expression.
To use alternative matches, you can use:
select * from strings where str similar to '%(abc|a)%'
This will return all strings matching abc
or a
. When we run this query, we are returned with all strings in the table as each contains an a
.
In case you want to disable any meta-characters in your matching substring, you can use a backslash to disable what we tend to call rogue characters inside a string.
Use the substring()
Function to SELECT
if String Contains a Substring Match in PostgreSQL
Another manipulation can be done to the substring()
function as follows:
select * from strings where str ~~ substring(str similar '%abc%' escape '#')
The substring()
returns the strings similar to abc
in our case or contains abc
. We then match the returned results to the str
using the ~~
operator, short for like
, and if they match, we select the results from the table.
This straightforward function even helps in dividing the string into separate parts can be seen in the syntax provided:
substring(string similar pattern escape escape-character)
or
substring(string from a pattern for escape-character)
or
substring(string, pattern, escape-character)
The escape-character
tends to divide our string to be matched into different parts if it contains the escape-character
at various points.
So if we run the statement:
select * from strings where str ~~ substring(str similar '#"abcd#"%' escape '#')
The '#"abcd#"%'
will be divided into abcd
, enclosed between two #
characters. Thus, we can also find the matching string abcd
.
Use the Posix
Regular Operators to SELECT
if String Contains a Substring Match in PostgreSQL
The above has been taken from a table described in the PostgreSQL documentation for Posix
operators that perform the same match functions.
You can use the statement as follows to see if a string contains a sub-string or not:
select * from strings where str ~ 'abc'
This will return the values carabc
and abcd
.
You can even use the regexp_match(string, pattern [, flags])
, which returns a null if no match is found. If a match is found, it will return an array with all the substrings matching a pattern.
To understand this, look at the following query:
select regexp_match('abdfabc', 'abd')
Output:
Using the another query,
select regexp_match('abdfabc', 'abf')
Output:
Now, you see how this expression finds a pattern and returns it. You can use this expression in a function, call the function for the select
operation and then return all matching strings.