RCSB PDB Help

Sequence Motif Search

Introduction

How Sequence Motif Search Works?

Documentation

● Search Options

○ Simple Mode

○ PROSITE Mode

○ Regex Mode

● Sequence Type

Search Results

Examples

Extended (advanced) information

● PROSITE Mode details

○ Terminology

○ Nonstandard but allowed in RCSB

○ RCSB-specific rules

○ Formal grammar

● Regex Mode details

○ Supported and non-supported constructs

○ Queries that will be rejected

○ Tips to simplify queries

Introduction

Sequence motifs are short, conserved segments found in protein or nucleic acid sequences. They appear across many proteins or genes and are thought to play specific functional roles. Identifying a specific sequence motif within a protein or nucleic acid can suggest that the molecule carries the function associated with that motif. In this way, sequence motifs can help predict functional properties.

In some motifs, every position in the sequence is conserved and necessary for the function. In others, only certain positions are conserved, and those specific residues or bases are critical for the motif’s functional activity.

RCSB.org provides a Sequence Motif search tool to search for sequence motifs within protein and nucleic acid sequences. Sequence motif searches differ from similarity-based sequence searches in two important ways:

Motifs are short. Because the defining sequence is very small, traditional similarity searches are often ineffective.
Motifs may include variable or unconserved positions. Some parts of a motif can vary or be non-conserved, so the search must allow for alternative residues and specify the conserved positions that may be non-contiguous within the sequence.

How Sequence Motif Search Works?

The Sequence Motif search allows you to query sequences in the PDB archive—and, optionally, available Computed Structure Models (CSMs)—using a motif expression. Motifs can be defined using one of three supported formats:

Each format lets you describe conserved positions, allowed residue variations, and flexible or unconstrained regions within a motif. The search engine scans all selected sequences and returns locations where the motif pattern matches the sequence according to the rules of the chosen syntax.

Documentation

You can access the Sequence Motif search by opening Advanced Search and clicking on (+) Sequence Motif from the list of available search tools, or go directly to the search using this link: Sequence Motif Search.

Search Options

Simple Mode

Input a sequence of one or more of one-letter codes. Ambiguous nucleotide codes are supported, and the wildcard symbol (X) can be used to represent any amino acid or nucleotide. Use < and > to match the N- and C-termini, respectively.

Examples

XPPXP (protein): SH3 domains (any → proline → proline → any → proline)
YYY (DNA): 3× cytosine/thymine
<SSS: any sequence that starts with 3× serine

PROSITE Mode

Complex queries can be expressed using PROSITE patterns. A PROSITE pattern is composed of one or more atoms, optionally separated by hyphens (-). The sequence is optionally terminated by a period (.).

X can be used to stand in for any amino acid or nucleotide type, and ambiguous nucleotide codes (e.g., B) are supported.

Note that this syntax is a superset of classic PROSITE: The search supports some patterns that may not be accepted by other tools, such as EXPASY ScanProsite. For complete information, refer to the PROSITE extended information.

Atom types

Each atom is one of seven types:

Literal: A one-letter code (e.g., A). This matches exactly 1 residue.
Any-of ([]): One or more codes enclosed in [], such as [ATC]. This matches exactly 1 residue whose code is listed.
None-of ({}): One or more codes enclosed in {}, such as {ATC}. This matches exactly 1 residue whose code is not listed.
N-terminus (<): An N-terminal marker, <, indicating the start of the sequence. If included, this must be the first element.
C-terminus (>): A C-terminal marker, >, indicating the end of the sequence. If included, this must be the last element.
Any-of / C-terminus (e.g., [A>]): A variable C-terminal element, such as [>AC], [A>C], or [AC>] (equivalently). This matches either the end of the sequence or exactly 1 reside among those listed (but not both).

Quantifiers

Each literal, wildcard, any-of, and none-of element may be followed by a quantifier to match the preceding element some number of times. The quantifier is enclosed in () and can be Exact, Minimum, or Range:

Exact: A(2) matches exactly AA.
Minimum: A(2,) matches at AA, AAA, … .
Range: A(2,4) matches AA, AAA, and AAAA.

Regex Mode

Regular expressions (regex) are also supported. This option is more powerful than PROSITE and may be familiar to programmers. Note that the service may refuse to process some queries.

A regex pattern contains one or more atoms, each with an optional quantifier. | denotes a logical or, and () groups atoms into groups.

Ambiguous nucleotide codes are not supported, nor is X. Use . instead of X, and use [CGT] (for DNA) or [CGU] (for RNA) instead of B.

Examples

W.{7}G.{20}L matches tryptophan → 7×any → glycine → 20×any → lysine.
C.{2,4}C.{12}H.{3,5}H matches the zinc finger motif that binds Zn in a DNA-binding domain.
^H+$ matches N-terminus → 1+ histidine → C-terminus.
[AG].{4}GK[ST] matches the Walker (P loop) motif that binds ATP or GTP.

Sequence Type

In all three Modes, amino acid residue (or nucleotide) types are specified using one-letter codes, which are defined by IUPAC. For example: for amino acid sequences, R is arginine; for RNA sequences, U is uracil. Nucleotide sequences also support so-called ambiguous codes; for example, S is either cytosine or guanine. Only Simple and PROSITE Modes support ambiguous codes. Below is a full reference of one-letter codes.

Queries are case-insensitive for all three Modes: ATGC and atgc are identical. (This also applies to X and x in Simple and PROSITE Modes.)

➤ Tables of one-letter codes

/U

Nucleotide Codes
code	meaning
`A`	adenine
`C`	cytosine
`G`	guanine
`T` ¹	thymine
`U` ¹	uracil
`B` ²	`C`/`G`/`T`/`U`
`D` ²	`A`/`G`/`T`/`U`
`H` ²	`A`/`C`/`T`/`U`
`K` ²	`G`/`T`
`M` ²	`A`/`C`
`R` ²	`A`/`G`
`S` ²	`C`/`G`
`V` ²	`A`/`C`/`G`
`W` ²	`A`/`T`/`U`
`Y` ²	`C`/`T`/`U`
`N` ²	any base

¹ T is restricted to DNA; U is restricted to RNA

³ Termed ambiguous; only supported in Simple and PROSITE Modes.

Amino Acid Codes
code	meaning
`A`	alanine
`C`	cysteine
`D`	aspartic acid
`E`	glutamic acid
`F`	phenylalanine
`G`	glycine
`H`	histidine
`I`	isoleucine
`K`	lysine
`L`	leucine
`M`	methionine
`N`	asparagine
`P`	proline
`Q`	glutamine
`R`	arginine
`S`	serine
`T`	threonine
`V`	valine
`W`	tryptophan
`Y`	tyrosine

Search Results

The search results show the numbering of the matched sequence region, corresponding to the numbering in the PDBx/mmCIF file.

For each match, click the Explore in 3D button to view the structure interactively. The matched region can be examined in detail within the 3D viewer.

Part of the query results page for a sequence motif search showing the regions of the polymer entity that matches the query sequence motif in a red box. Clicking on the 3D view marked with red arrows opens the structure in Mol*.

Examples

Query for SH3 domains – use the Simple Mode query XPPXP, where X is any residue and P is proline.
Query for a specific pattern of sequence – use the PROSITE Mode query [AC]-x-V-x(4)-{ED}, which translates to [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}.
Query for the Walker (P loop) motif that binds ATP or GTP – use the Regex Mode query [AG]....GK[ST], where A or G is followed by 4 variable residues, then G and K, and finally S or T.

Extended (advanced) information

PROSITE Mode details

Terminology

This documentation uses the following definitions.

Atom: A PROSITE syntax item to match against 1 residue or nucleotide. A literal (e.g., A), gap (.), any-of (e.g., [CG]), none-of (e.g., {AT}, N-terminus (<), C-terminus (>[>AT])
Term: An atom with its quantifier, if any

Nonstandard but allowed in RCSB

RCSB PROSITE is more forgiving than standard PROSITE; it differs in the following ways:

Case is ignored. A and a are the same, as are X and x.
Range quantifiers ((x,y)) may be used for all atoms, not just gaps (x). For example, A(1,4) matches between 1 and 4 alanines. In contrast, standard PROSITE only permits, for example, x(1,4).
Hyphens (-) may be omitted, even with one-letter nucleotide codes, such as B. Hyphens are ignored as long as they are used in valid positions.

RCSB-specific rules

Some parts of the PROSITE specification can be interpreted in multiple ways. RCSB PROSITE has decided on these rules:

Spaces (characters with Unicode category Zs) are ignored when in reasonable positions. For example, A T{1, 3} is allowed.
The query must contain at least 1 atom. (<, >, <>, and the empty string are forbidden.)
Any-of matches ([]) require at least 1 character.
None-of matches ({}) cannot include every one-letter code. {ATGC} is invalid for DNA sequences (and could never match a sequence).
An exact quantifier (n) is allowed if and only if n ≥ 1.
A range quantifier (m, n) is allowed if and only if n ≥ m and m > 0.

Formal grammar

This grammar uses RFC 5234 ABNF.

query            = start *(['-'] term) ['-' end] ['.']
                 ;  ^          ^           ^
                 ; required    0 or more   optional

start            = term / (nterm non-gap-term) / (nterm gap)
                 % EITHER: A term (1+ elements) without an N-term
                 % OR: N-term with non-gap term (1+ elements)
                 % OR: single gap (1 element; non-repeated)
end              = term / (non-gap-term / cterm) / (gap cterm)

term             = element [count / range]
element          = code / any-of / none-of / gap
non-gap-term     = non-gap-element [count / range]
non-gap-element  = code / any-of / none-of

aa               = "a one-letter code"
gap              = 'x'
                 ; Matches any single residue
any-of           = '[' 1*aa ']'
                 ; Matches any single residue included in []
                 ; For example, [ACE] matches A, C, or E
none-of          = '{' 1*aa '}'
                 ; Matches any single residue NOT included in {}

count            = '(' natural ')'
                 ; An exact number of times to repeat the preceding element
                 ; For example, [AW](3) is equivalent to [AW][AW][AW]
range            = '(' number ',' natural ')'
                 ; A min and max number of times to repeat the preceding element
                 ; For example, A{1,3} matches A, AA, and AAA
                 ; Note: min must be less than max

nterm            = '<'
                 ; Matches the sequence start (N-terminus)
cterm            = cterm-literal / cterm-or-any-of
cterm-literal    = '>'=
                 ; Matches the sequence end (C-terminus)
cterm-or-any-of  = ('[' (1*aa '>' *aa) / (*aa '>' 1**aa) ']')
                 ; Matches either the sequence end (C-terminus),
                 ; OR an aa included in [] / an aa not included in {}
                 ; For example, [A>] matches either the sequence end or A.
                 ; Valid examples: [A>], [>A], [A,C], [ACDE>]
                 ; Invalid examples: [A>>], [A>C>], [>], []

number           = 1*DIGIT
natural          = NONZERO *DIGIT
NONZERO          = %x31-39

Regex Mode details

Supported and non-supported constructs

The query syntax is IEEE POSIX Extended Regular Expressions. Nearly all of the standard is supported, including advanced constructs like lookarounds and backreferences.

However, a few things are not supported. Most notably, characters that are not in one-letter codes are not allowed in literals or character classes. For example, Z and [A-Z] would result in an error. Named character classes, such as \s and \p{Alpha}, are also not supported.

Queries that will be rejected

In addition, the service will not allow expressions that could seriously degrade performance. Specifically, these are expressions with non-polynomial worst-case runtime or space complexity. The service will reject:

patterns that use non-possessive, inexact quantifiers on groups that match a variable number of characters, n, n > 1;
patterns that use quantifiers on groups satisfying certain (other) ways;
patterns that use lazy, inexact quantifiers excessively or in certain ways;
patterns that use lookarounds excessively or in certain ways;
patterns with non-polynomial worst-case runtime or memory requirements; and
patterns with excessive total complexity.

API users should also note rare failure types: service-wide limits (HTTP 503), excessive query duration (504), and excessive querying (429).

Tips to simplify queries

Follow these guidelines to avoid a query being rejected.

Do not use lazy quantifiers.
Avoid lookarounds.
When applying quantifiers to groups, make sure the group is simple and only use either greedy ? or (preferably) a possessive quantifier.
Use possessive quantifiers where possible.
Do not begin or end a sequence with .*, ^.*, .*$, or similar.

Note that you can replace most uses of a lazy quantifier with one or two greedy quantifiers.

Please report any encountered broken links to info@rcsb.org

Last updated: 12/9/2025