package html
Import Path
golang.org/x/net/html (on go.dev)
Dependency Relation
imports 9 packages, and imported by 4 packages
Involved Source Files
const.go
Package html implements an HTML5-compliant tokenizer and parser.
Tokenization is done by creating a Tokenizer for an io.Reader r. It is the
caller's responsibility to ensure that r provides UTF-8 encoded HTML.
z := html.NewTokenizer(r)
Given a Tokenizer z, the HTML is tokenized by repeatedly calling z.Next(),
which parses the next token and returns its type, or an error:
for {
tt := z.Next()
if tt == html.ErrorToken {
// ...
return ...
}
// Process the current token.
}
There are two APIs for retrieving the current token. The high-level API is to
call Token; the low-level API is to call Text or TagName / TagAttr. Both APIs
allow optionally calling Raw after Next but before Token, Text, TagName, or
TagAttr. In EBNF notation, the valid call sequence per token is:
Next {Raw} [ Token | Text | TagName {TagAttr} ]
Token returns an independent data structure that completely describes a token.
Entities (such as "<") are unescaped, tag names and attribute keys are
lower-cased, and attributes are collected into a []Attribute. For example:
for {
if z.Next() == html.ErrorToken {
// Returning io.EOF indicates success.
return z.Err()
}
emitToken(z.Token())
}
The low-level API performs fewer allocations and copies, but the contents of
the []byte values returned by Text, TagName and TagAttr may change on the next
call to Next. For example, to extract an HTML page's anchor text:
depth := 0
for {
tt := z.Next()
switch tt {
case html.ErrorToken:
return z.Err()
case html.TextToken:
if depth > 0 {
// emitBytes should copy the []byte it receives,
// if it doesn't process it immediately.
emitBytes(z.Text())
}
case html.StartTagToken, html.EndTagToken:
tn, _ := z.TagName()
if len(tn) == 1 && tn[0] == 'a' {
if tt == html.StartTagToken {
depth++
} else {
depth--
}
}
}
}
Parsing is done by calling Parse with an io.Reader, which returns the root of
the parse tree (the document element) as a *Node. It is the caller's
responsibility to ensure that the Reader provides UTF-8 encoded HTML. For
example, to process each anchor node in depth-first order:
doc, err := html.Parse(r)
if err != nil {
// ...
}
var f func(*html.Node)
f = func(n *html.Node) {
if n.Type == html.ElementNode && n.Data == "a" {
// Do something with n...
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
}
f(doc)
The relevant specifications include:
https://html.spec.whatwg.org/multipage/syntax.html and
https://html.spec.whatwg.org/multipage/syntax.html#tokenization
doctype.go
entity.go
escape.go
foreign.go
node.go
parse.go
render.go
token.go
Code Examples
package main
import (
"fmt"
"golang.org/x/net/html"
"log"
"strings"
)
func main() {
s := `Links:
`
doc, err := html.Parse(strings.NewReader(s))
if err != nil {
log.Fatal(err)
}
var f func(*html.Node)
f = func(n *html.Node) {
if n.Type == html.ElementNode && n.Data == "a" {
for _, a := range n.Attr {
if a.Key == "href" {
fmt.Println(a.Val)
break
}
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
}
f(doc)
}
Package-Level Type Names (total 14, in which 7 are exported)
An Attribute is an attribute namespace-key-value triple. Namespace is
non-empty for foreign attributes like xlink, Key is alphabetic (and hence
does not contain escapable characters like '&', '<' or '>'), and Val is
unescaped (it looks like "a<b" rather than "a<b").
Namespace is only used by the parser, not the tokenizer.
Key string
Namespace string
Val string
func github.com/microcosm-cc/bluemonday.(*Policy).sanitizeAttrs(elementName string, attrs []Attribute, aps map[string]bluemonday.attrPolicy) []Attribute
func adjustAttributeNames(aa []Attribute, nameMap map[string]string)
func adjustForeignAttributes(aa []Attribute)
func github.com/microcosm-cc/bluemonday.(*Policy).sanitizeAttrs(elementName string, attrs []Attribute, aps map[string]bluemonday.attrPolicy) []Attribute
A Node consists of a NodeType and some Data (tag name for element nodes,
content for text) and are part of a tree of Nodes. Element nodes may also
have a Namespace and contain a slice of Attributes. Data is unescaped, so
that it looks like "a<b" rather than "a<b". For element nodes, DataAtom
is the atom for Data, or zero if Data is not a known tag name.
An empty Namespace implies a "http://www.w3.org/1999/xhtml" namespace.
Similarly, "math" is short for "http://www.w3.org/1998/Math/MathML", and
"svg" is short for "http://www.w3.org/2000/svg".
Attr []Attribute
Data string
DataAtom atom.Atom
FirstChild *Node
LastChild *Node
Namespace string
NextSibling *Node
Parent *Node
PrevSibling *Node
Type NodeType
AppendChild adds a node c as a child of n.
It will panic if c already has a parent or siblings.
InsertBefore inserts newChild as a child of n, immediately before oldChild
in the sequence of n's children. oldChild may be nil, in which case newChild
is appended to the end of n's children.
It will panic if newChild already has a parent or siblings.
RemoveChild removes a node c that is a child of n. Afterwards, c will have
no parent and no siblings.
It will panic if c's parent is not n.
clone returns a new node with the same type, data and attributes.
The clone has no parent, no siblings and no children.
func Parse(r io.Reader) (*Node, error)
func ParseFragment(r io.Reader, context *Node) ([]*Node, error)
func ParseFragmentWithOptions(r io.Reader, context *Node, opts ...ParseOption) ([]*Node, error)
func ParseWithOptions(r io.Reader, opts ...ParseOption) (*Node, error)
func github.com/andybalholm/cascadia.Filter(nodes []*Node, m cascadia.Matcher) (result []*Node)
func github.com/andybalholm/cascadia.Query(n *Node, m cascadia.Matcher) *Node
func github.com/andybalholm/cascadia.QueryAll(n *Node, m cascadia.Matcher) []*Node
func github.com/andybalholm/cascadia.Selector.Filter(nodes []*Node) (result []*Node)
func github.com/andybalholm/cascadia.Selector.MatchAll(n *Node) []*Node
func github.com/andybalholm/cascadia.Selector.MatchFirst(n *Node) *Node
func parseDoctype(s string) (n *Node, quirks bool)
func (*Node).clone() *Node
func golang.org/x/pkgsite/internal/testing/htmlcheck.allMatching(n *Node, sel cascadia.Sel) []*Node
func github.com/andybalholm/cascadia.queryInto(n *Node, m cascadia.Matcher, storage []*Node) []*Node
func github.com/andybalholm/cascadia.Selector.matchAllInto(n *Node, storage []*Node) []*Node
func ParseFragment(r io.Reader, context *Node) ([]*Node, error)
func ParseFragmentWithOptions(r io.Reader, context *Node, opts ...ParseOption) ([]*Node, error)
func Render(w io.Writer, n *Node) error
func (*Node).AppendChild(c *Node)
func (*Node).InsertBefore(newChild, oldChild *Node)
func (*Node).RemoveChild(c *Node)
func github.com/andybalholm/cascadia.Filter(nodes []*Node, m cascadia.Matcher) (result []*Node)
func github.com/andybalholm/cascadia.Query(n *Node, m cascadia.Matcher) *Node
func github.com/andybalholm/cascadia.QueryAll(n *Node, m cascadia.Matcher) []*Node
func github.com/andybalholm/cascadia.Matcher.Match(n *Node) bool
func github.com/andybalholm/cascadia.Sel.Match(n *Node) bool
func github.com/andybalholm/cascadia.Selector.Filter(nodes []*Node) (result []*Node)
func github.com/andybalholm/cascadia.Selector.Match(n *Node) bool
func github.com/andybalholm/cascadia.Selector.MatchAll(n *Node) []*Node
func github.com/andybalholm/cascadia.Selector.MatchFirst(n *Node) *Node
func github.com/andybalholm/cascadia.SelectorGroup.Match(n *Node) bool
func copyAttributes(dst *Node, src Token)
func htmlIntegrationPoint(n *Node) bool
func isSpecialElement(element *Node) bool
func mathMLTextIntegrationPoint(n *Node) bool
func render(w writer, n *Node) error
func render1(w writer, n *Node) error
func reparentChildren(dst, src *Node)
func golang.org/x/pkgsite/internal/frontend.walkHTML(n *Node, info *source.Info, readme *internal.Readme) bool
func golang.org/x/pkgsite/internal/testing/htmlcheck.allMatching(n *Node, sel cascadia.Sel) []*Node
func golang.org/x/pkgsite/internal/testing/htmlcheck.check(n *Node, Checkers []htmlcheck.Checker) error
func golang.org/x/pkgsite/internal/testing/htmlcheck.dump(n *Node, depth int)
func golang.org/x/pkgsite/internal/testing/htmlcheck.nodeText(n *Node, b *strings.Builder)
func github.com/andybalholm/cascadia.attributeDashMatch(key, val string, n *Node) bool
func github.com/andybalholm/cascadia.attributeNotEqualMatch(key, val string, n *Node) bool
func github.com/andybalholm/cascadia.attributePrefixMatch(key, val string, n *Node) bool
func github.com/andybalholm/cascadia.attributeRegexMatch(key string, rx *regexp.Regexp, n *Node) bool
func github.com/andybalholm/cascadia.attributeSubstringMatch(key, val string, n *Node) bool
func github.com/andybalholm/cascadia.attributeSuffixMatch(key, val string, n *Node) bool
func github.com/andybalholm/cascadia.childMatch(a, d cascadia.Matcher, n *Node) bool
func github.com/andybalholm/cascadia.descendantMatch(a, d cascadia.Matcher, n *Node) bool
func github.com/andybalholm/cascadia.hasChildMatch(n *Node, a cascadia.Matcher) bool
func github.com/andybalholm/cascadia.hasDescendantMatch(n *Node, a cascadia.Matcher) bool
func github.com/andybalholm/cascadia.matchAttribute(n *Node, key string, f func(string) bool) bool
func github.com/andybalholm/cascadia.nodeOwnText(n *Node) string
func github.com/andybalholm/cascadia.nodeText(n *Node) string
func github.com/andybalholm/cascadia.nthChildMatch(a, b int, last, ofType bool, n *Node) bool
func github.com/andybalholm/cascadia.queryInto(n *Node, m cascadia.Matcher, storage []*Node) []*Node
func github.com/andybalholm/cascadia.queryInto(n *Node, m cascadia.Matcher, storage []*Node) []*Node
func github.com/andybalholm/cascadia.siblingMatch(s1, s2 cascadia.Matcher, adjacent bool, n *Node) bool
func github.com/andybalholm/cascadia.simpleNthChildMatch(b int, ofType bool, n *Node) bool
func github.com/andybalholm/cascadia.simpleNthLastChildMatch(b int, ofType bool, n *Node) bool
func github.com/andybalholm/cascadia.writeNodeText(n *Node, b *bytes.Buffer)
func github.com/andybalholm/cascadia.Selector.matchAllInto(n *Node, storage []*Node) []*Node
func github.com/andybalholm/cascadia.Selector.matchAllInto(n *Node, storage []*Node) []*Node
var scopeMarker
A NodeType is the type of a Node.
const CommentNode
const DoctypeNode
const DocumentNode
const ElementNode
const ErrorNode
const RawNode
const TextNode
const scopeMarkerNode
ParseOption configures a parser.
func ParseOptionEnableScripting(enable bool) ParseOption
func ParseFragmentWithOptions(r io.Reader, context *Node, opts ...ParseOption) ([]*Node, error)
func ParseWithOptions(r io.Reader, opts ...ParseOption) (*Node, error)
A Token consists of a TokenType and some Data (tag name for start and end
tags, content for text, comments and doctypes). A tag Token may also contain
a slice of Attributes. Data is unescaped for all Tokens (it looks like "a<b"
rather than "a<b"). For tag Tokens, DataAtom is the atom for Data, or
zero if Data is not a known tag name.
Attr []Attribute
Data string
DataAtom atom.Atom
Type TokenType
String returns a string representation of the Token.
tagString returns a string representation of a tag Token's Data and Attr.
T : expvar.Var
T : fmt.Stringer
T : context.stringer
T : runtime.stringer
func (*Tokenizer).Token() Token
func copyAttributes(dst *Node, src Token)
A Tokenizer returns a stream of HTML Tokens.
allowCDATA is whether CDATA sections are allowed in the current context.
attr [][2]span
buf []byte
convertNUL is whether NUL bytes in the current token's data should
be converted into \ufffd replacement characters.
buf[data.start:data.end] holds the raw bytes of the current token's data:
a text token's text, a tag token's tag name, etc.
err is the first error encountered during tokenization. It is possible
for tt != Error && err != nil to hold: this means that Next returned a
valid token but the subsequent Next call will return an error token.
For example, if the HTML text input was just "plain", then the first
Next call would set z.err to io.EOF but return a TextToken, and all
subsequent Next calls would return an ErrorToken.
err is never reset. Once it becomes non-nil, it stays non-nil.
maxBuf limits the data buffered in buf. A value of 0 means unlimited.
nAttrReturned int
pendingAttr is the attribute key and value currently being tokenized.
When complete, pendingAttr is pushed onto attr. nAttrReturned is
incremented on each call to TagAttr.
r is the source of the HTML text.
buf[raw.start:raw.end] holds the raw bytes of the current token.
buf[raw.end:] is buffered input that will yield future tokens.
rawTag is the "script" in "</script>" that closes the next token. If
non-empty, the subsequent call to Next will return a raw or RCDATA text
token: one that treats "<p>" as text instead of an element.
rawTag's contents are lower-cased.
readErr is the error returned by the io.Reader r. It is separate from
err because it is valid for an io.Reader to return (n int, err1 error)
such that n > 0 && err1 != nil, and callers should always process the
n > 0 bytes before considering the error err1.
textIsRaw is whether the current text token's data is not escaped.
tt is the TokenType of the current token.
AllowCDATA sets whether or not the tokenizer recognizes <![CDATA[foo]]> as
the text "foo". The default value is false, which means to recognize it as
a bogus comment "<!-- [CDATA[foo]] -->" instead.
Strictly speaking, an HTML5 compliant tokenizer should allow CDATA if and
only if tokenizing foreign content, such as MathML and SVG. However,
tracking foreign-contentness is difficult to do purely in the tokenizer,
as opposed to the parser, due to HTML integration points: an <svg> element
can contain a <foreignObject> that is foreign-to-SVG but not foreign-to-
HTML. For strict compliance with the HTML5 tokenization algorithm, it is the
responsibility of the user of a tokenizer to call AllowCDATA as appropriate.
In practice, if using the tokenizer without caring whether MathML or SVG
CDATA is text or comments, such as tokenizing HTML to find all the anchor
text, it is acceptable to ignore this responsibility.
Buffered returns a slice containing data buffered but not yet tokenized.
Err returns the error associated with the most recent ErrorToken token.
This is typically io.EOF, meaning the end of tokenization.
Next scans the next token and returns its type.
NextIsNotRawText instructs the tokenizer that the next token should not be
considered as 'raw text'. Some elements, such as script and title elements,
normally require the next token after the opening tag to be 'raw text' that
has no child elements. For example, tokenizing "<title>a<b>c</b>d</title>"
yields a start tag token for "<title>", a text token for "a<b>c</b>d", and
an end tag token for "</title>". There are no distinct start tag or end tag
tokens for the "<b>" and "</b>".
This tokenizer implementation will generally look for raw text at the right
times. Strictly speaking, an HTML5 compliant tokenizer should not look for
raw text if in foreign content: <title> generally needs raw text, but a
<title> inside an <svg> does not. Another example is that a <textarea>
generally needs raw text, but a <textarea> is not allowed as an immediate
child of a <select>; in normal parsing, a <textarea> implies </select>, but
one cannot close the implicit element when parsing a <select>'s InnerHTML.
Similarly to AllowCDATA, tracking the correct moment to override raw-text-
ness is difficult to do purely in the tokenizer, as opposed to the parser.
For strict compliance with the HTML5 tokenization algorithm, it is the
responsibility of the user of a tokenizer to call NextIsNotRawText as
appropriate. In practice, like AllowCDATA, it is acceptable to ignore this
responsibility for basic usage.
Note that this 'raw text' concept is different from the one offered by the
Tokenizer.Raw method.
Raw returns the unmodified text of the current token. Calling Next, Token,
Text, TagName or TagAttr may change the contents of the returned slice.
The token stream's raw bytes partition the byte stream (up until an
ErrorToken). There are no overlaps or gaps between two consecutive token's
raw bytes. One implication is that the byte offset of the current token is
the sum of the lengths of all previous tokens' raw bytes.
SetMaxBuf sets a limit on the amount of data buffered during tokenization.
A value of 0 means unlimited.
TagAttr returns the lower-cased key and unescaped value of the next unparsed
attribute for the current tag token and whether there are more attributes.
The contents of the returned slices may change on the next call to Next.
TagName returns the lower-cased name of a tag token (the `img` out of
`<IMG SRC="foo">`) and whether the tag has attributes.
The contents of the returned slice may change on the next call to Next.
Text returns the unescaped text of a text, comment or doctype token. The
contents of the returned slice may change on the next call to Next.
Token returns the current Token. The result's Data and Attr values remain
valid after subsequent Next calls.
readByte returns the next byte from the input stream, doing a buffered read
from z.r into z.buf if necessary. z.buf[z.raw.start:z.raw.end] remains a contiguous byte
slice that holds all the bytes read so far for the current token.
It sets z.err if the underlying reader returns an error.
Pre-condition: z.err == nil.
readCDATA attempts to read a CDATA section and returns true if
successful. The opening "<!" has already been consumed.
readComment reads the next comment token starting with "<!--". The opening
"<!--" has already been consumed.
readDoctype attempts to read a doctype declaration and returns true if
successful. The opening "<!" has already been consumed.
readMarkupDeclaration reads the next token starting with "<!". It might be
a "<!--comment-->", a "<!DOCTYPE foo>", a "<![CDATA[section]]>" or
"<!a bogus comment". The opening "<!" has already been consumed.
readRawEndTag attempts to read a tag like "</foo>", where "foo" is z.rawTag.
If it succeeds, it backs up the input position to reconsume the tag and
returns true. Otherwise it returns false. The opening "</" has already been
consumed.
readRawOrRCDATA reads until the next "</foo>", where "foo" is z.rawTag and
is typically something like "script" or "textarea".
readScript reads until the next </script> tag, following the byzantine
rules for escaping/hiding the closing tag.
readStartTag reads the next start tag token. The opening "<a" has already
been consumed, where 'a' means anything in [A-Za-z].
readTag reads the next tag token and its attributes. If saveAttr, those
attributes are saved in z.attr, otherwise z.attr is set to an empty slice.
The opening "<a" or "</a" has already been consumed, where 'a' means anything
in [A-Za-z].
readTagAttrKey sets z.pendingAttr[0] to the "k" in "<div k=v>".
Precondition: z.err == nil.
readTagAttrVal sets z.pendingAttr[1] to the "v" in "<div k=v>".
readTagName sets z.data to the "div" in "<div k=v>". The reader (z.raw.end)
is positioned such that the first byte of the tag name (the "d" in "<div")
has already been consumed.
readUntilCloseAngle reads until the next ">".
skipWhiteSpace skips past any white space.
startTagIn returns whether the start tag in z.buf[z.data.start:z.data.end]
case-insensitively matches any element of ss.
func NewTokenizer(r io.Reader) *Tokenizer
func NewTokenizerFragment(r io.Reader, contextTag string) *Tokenizer
A TokenType is the type of a Token.
String returns a string representation of the TokenType.
T : expvar.Var
T : fmt.Stringer
T : context.stringer
T : runtime.stringer
func (*Tokenizer).Next() TokenType
func (*Tokenizer).readMarkupDeclaration() TokenType
func (*Tokenizer).readStartTag() TokenType
const CommentToken
const DoctypeToken
const EndTagToken
const ErrorToken
const SelfClosingTagToken
const StartTagToken
const TextToken
Package-Level Functions (total 50, in which 10 are exported)
EscapeString escapes special characters like "<" to become "<". It
escapes only five such characters: <, >, &, ' and ".
UnescapeString(EscapeString(s)) == s always holds, but the converse isn't
always true.
NewTokenizer returns a new HTML Tokenizer for the given Reader.
The input is assumed to be UTF-8 encoded.
NewTokenizerFragment returns a new HTML Tokenizer for the given Reader, for
tokenizing an existing element's InnerHTML fragment. contextTag is that
element's tag, such as "div" or "iframe".
For example, how the InnerHTML "a<b" is tokenized depends on whether it is
for a <p> tag or a <script> tag.
The input is assumed to be UTF-8 encoded.
Parse returns the parse tree for the HTML from the given Reader.
It implements the HTML5 parsing algorithm
(https://html.spec.whatwg.org/multipage/syntax.html#tree-construction),
which is very complicated. The resultant tree can contain implicitly created
nodes that have no explicit <tag> listed in r's data, and nodes' parents can
differ from the nesting implied by a naive processing of start and end
<tag>s. Conversely, explicit <tag>s in r's data can be silently dropped,
with no corresponding node in the resulting tree.
The input is assumed to be UTF-8 encoded.
ParseFragment parses a fragment of HTML and returns the nodes that were
found. If the fragment is the InnerHTML for an existing element, pass that
element in context.
It has the same intricacies as Parse.
ParseFragmentWithOptions is like ParseFragment, with options.
ParseOptionEnableScripting configures the scripting flag.
https://html.spec.whatwg.org/multipage/webappapis.html#enabling-and-disabling-scripting
By default, scripting is enabled.
ParseWithOptions is like Parse, with options.
Render renders the parse tree n to the given writer.
Rendering is done on a 'best effort' basis: calling Parse on the output of
Render will always result in something similar to the original tree, but it
is not necessarily an exact clone unless the original tree was 'well-formed'.
'Well-formed' is not easily specified; the HTML5 specification is
complicated.
Calling Parse on arbitrary input typically results in a 'well-formed' parse
tree. However, it is possible for Parse to yield a 'badly-formed' parse tree.
For example, in a 'well-formed' parse tree, no <a> element is a child of
another <a> element: parsing "<a><a>" results in two sibling elements.
Similarly, in a 'well-formed' parse tree, no <a> element is a child of a
<table> element: parsing "<p><table><a>" results in a <p> with two sibling
children; the <a> is reparented to the <table>'s parent. However, calling
Parse on "<a><table><a>" does not return an error, but the result has an <a>
element with an <a> child, and is therefore not 'well-formed'.
Programmatically constructed trees are typically also 'well-formed', but it
is possible to construct a tree that looks innocuous but, when rendered and
re-parsed, results in a different tree. A simple example is that a solitary
text node would become a tree containing <html>, <head> and <body> elements.
Another example is that the programmatic equivalent of "a<head>b</head>c"
becomes "<html><head><head/><body>abc</body></html>".
UnescapeString unescapes entities like "<" to become "<". It unescapes a
larger range of entities than EscapeString escapes. For example, "á"
unescapes to "á", as does "á" and "&xE1;".
UnescapeString(EscapeString(s)) == s always holds, but the converse isn't
always true.
Package-Level Variables (total 16, in which 1 are exported)
ErrBufferExceeded means that the buffering limit was exceeded.
Package-Level Constants (total 26, in which 14 are exported)
const CommentNode NodeType = 4
A CommentToken looks like <!--x-->.
const DoctypeNode NodeType = 5
A DoctypeToken looks like <!DOCTYPE x>
const ElementNode NodeType = 3
An EndTagToken looks like </a>.
ErrorToken means that an error occurred during tokenization.
RawNode nodes are not returned by the parser, but can be part of the
Node tree passed to func Render to insert raw HTML (without escaping).
If so, this package makes no guarantee that the rendered HTML is secure
(from e.g. Cross Site Scripting attacks) or well-formed.
A SelfClosingTagToken tag looks like <br/>.
A StartTagToken looks like <a>.
TextToken means a text node.
![]() |
The pages are generated with Golds v0.3.2-preview. (GOOS=darwin GOARCH=amd64) Golds is a Go 101 project developed by Tapir Liu. PR and bug reports are welcome and can be submitted to the issue list. Please follow @Go100and1 (reachable from the left QR code) to get the latest news of Golds. |