Involved Source Files
Package match defines matching algorithms and support code for the license checker.
regexp.gorematch.goresyntax.go
Package-Level Type Names (total 20, in which 8 are exported)
/* sort exporteds by: | */
A Dict maps words to integer indexes in a word list, of type WordID.
The zero Dict is an empty dictionary ready for use.
Lookup and Words are read-only operations,
safe for any number of concurrent calls from multiple goroutines.
Insert is a write operation; it must not run concurrently with
any other call, whether to Insert, Lookup, or Words.
// dict maps word to index in list
// list of known words
Insert adds the word w to the word list, returning its index.
If w is already in the word list, it is not added again; Insert returns the existing index.
InsertSplit splits text into a sequence of lowercase words,
inserting any new words in the dictionary.
Lookup looks for the word w in the word list and returns its index.
If w is not in the word list, Lookup returns BadWord.
Split splits text into a sequence of lowercase words.
It does not add any new words to the dictionary.
Unrecognized words are reported as having ID = BadWord.
Words returns the current word list.
The list is not a copy; the caller can read but must not modify the list.
(*T) split(text string, insert bool) []Word
func (*LRE).Dict() *Dict
func (*MultiLRE).Dict() *Dict
func ParseLRE(d *Dict, file, s string) (*LRE, error)
func reParse(d *Dict, s string, strict bool) (*reSyntax, error)
func rePrint(b *bytes.Buffer, re *reSyntax, d *Dict)
An LRE is a compiled license regular expression.
TODO: Move this comment somewhere non-internal later.
A license regular expression (LRE) is a pattern syntax intended for
describing large English texts such as software licenses, with minor
allowed variations. The pattern syntax and the matching are word-based
and case-insensitive; punctuation is ignored in the pattern and in
the matched text.
The valid LRE patterns are:
word - a single case-insensitive word
__N__ - any sequence of up to N words
expr1 expr2 - concatenation
expr1 || expr2 - alternation
(( expr )) - grouping
expr?? - zero or one instances of expr
//** text **// - a comment
To make patterns harder to misread in large texts:
- || must only appear inside (( ))
- ?? must only follow (( ))
- (( must be at the start of a line, preceded only by spaces
- )) must be at the end of a line, followed only by spaces and ??.
For example:
//** https://en.wikipedia.org/wiki/Filler_text **//
Now is
((not))??
the time for all good
((men || women || people))
to come to the aid of their __1__.
dfareDFAdict*DictfilestringonceDFAsync.OnceprogreProgsyntax*reSyntax
Dict returns the Dict used by the LRE.
File returns the file name passed to ParseLRE.
compile initializes lre.dfa.
It is invoked lazily (in Match) because most LREs end up only
being inputs to a MultiLRE; we never need their DFAs directly.
Match reports whether text matches the license regexp.
func ParseLRE(d *Dict, file, s string) (*LRE, error)
func NewMultiLRE(list []*LRE) (_ *MultiLRE, err error)
A Match records the position of a single match in a text.
// word index of end of match
// index of LRE in list passed to NewMultiLRE
// word index of start of match
A Matches is a collection of all leftmost-longest, non-overlapping matches in text.
// the matches
// the entire text
// the text, split into Words
func (*MultiLRE).Match(text string) *Matches
A MultiLRE matches multiple LREs simultaneously against a text.
It is more efficient than matching each LRE in sequence against the text.
// compiled DFA for all LREs
// dict shared by all LREs
start contains the two-word phrases
where a match can validly start,
to allow for faster scans over non-license text.
Dict returns the Dict used by the MultiLRE.
Match reports all leftmost-longest, non-overlapping matches in text.
It always returns a non-nil *Matches, in order to return the split text.
Check len(matches.List) to see whether any matches were found.
func NewMultiLRE(list []*LRE) (_ *MultiLRE, err error)
A dfaBuilder holds state for building a DFA from a reProg.
// DFA so far
// encoding buffer
// map from encoded NFA state to dfa array offset
// program being processed
add returns the offset of the NFA state s in the DFA b.dfa,
adding it to the end of the DFA if needed.
An nfaState represents the state of the NFA - all possible instruction locations -
after reading a particular input.
add adds pc and other states reachable from it
to the set of possible instruction locations in *s.
appendEncoding appends a byte encoding of the state s to enc and returns the result.
match returns the smallest match value of matches reached in state s,
or -1 if there is no match.
next returns the new state that results from reading word w in state s,
and whether a match has been belatedly detected just before w.
trim canonicalizes *s by sorting it and removing unnecessary states.
All that must be preserved between input tokens are the instruction
locations that advance the input (instWord and instAny) or that
report a match (instMatch).
words returns the list of distinct words that can
lead the NFA out of state s and into a new state.
The returned list is sorted in increasing order.
If the state can match any word (using instAny),
the word ID AnyWord is first in the list.
func nfaStart(prog reProg) nfaState
A phrase is a phrase of up to two words.
The zero-word phrase is phrase{NoWord, NoWord}.
A single-word phrase w is phrase{w, NoWord}.
reCompile holds compilation state for a single regexp.
cut[]reCut
// compiling the end of the pattern
// first problem found; report delayed until end of compile
// program being constructed
compile appends the compiled program for re to c.prog.
compileCut emits an instCut instruction for cut.
compileCuts emits instCut instructions for all pending cuts.
See comment at top of file for information about cuts.
mergeCut merges the two cut lists cut1 and cut2 into a single cut list.
Cuts with the same start but different triggers are merged into a
single entry with the larger of the two triggers.
reduceCut records that a new literal word has been matched,
reducing the triggers in c.cut by 1 and emitting any triggered cuts.
reCut holds the information about a pending cut.
// cut off the alt at pc = start
// ... after trigger more literal word matches
A reDFA is an encoded DFA over word IDs.
The encoded DFA is a sequence of encoded DFA states, packed together.
Each DFA state is identified by the index where it starts in the slice.
The initial DFA state is at the start of the slice, index 0.
Each DFA state records whether reaching that state counts as matching
the input, which of multiple regexps matched, and then the transition
table for the possible words that lead to new states. (If a word is found
that is not in the current state's transition table, the DFA stops immediately
with no match.)
The encoding of this state information is:
- a one-word header M | N<<1, where M is 0 for a non-match, 1 for a match,
and N is the number of words in the table.
This header is conveniently also the number of words that follow in the encoding.
- if M == 1, a one-word value V that is the match value to report,
identifying which of a set of regexps has been matched.
- N two-word pairs W:NEXT indicating that if word W is seen, the DFA should
move to the state at offset NEXT. The pairs are sorted by W. An entry for W == AnyWord
is treated as matching any input word; an exact match later in the list takes priority.
The list is sorted by W, so AnyWord is always first if present.
match looks for a match of DFA at the start of words,
which are the result of dict.Split(text) or a subslice of it.
match returns the match ID of the longest match, as well as
the index in words immediately following the last matched word.
If there is no match, match returns -1, 0.
stateAt returns (partly) decoded information about the
DFA state at the given offset.
If the state is a matching state, stateAt returns match >= 0 specifies the match ID.
If the state is not a matching state, stateAt returns match == -1.
Either way, stateAt also returns the outgoing transition list
interlaced in the delta slice. The caller can iterate over delta using:
for i := 0; i < len(delta); i += 2 {
dw, dnext := WordID(delta[i]), delta[i+1]
if currentWord == dw {
off = dnext
}
}
string returns a textual listing of the DFA.
The dictionary d supplies the actual words for the listing.
func reCompileDFA(prog reProg) reDFA
A reInst is a regexp instruction: an opcode and a numeric argument
argint32opinstOp
A reParser is the regexp parser state.
dict*Dictstack[]*reSyntax
alternate replaces the top of the stack (above the topmost '((') with its alternation.
collapse returns the result of applying op to sub.
If sub contains op nodes, they all get hoisted up
so that there is never a concat of a concat or an
alternate of an alternate.
concat replaces the top of the stack (above the topmost '||' or '((') with its concatenation.
push pushes the regexp re onto the parse stack and returns the regexp.
quest replaces the top stack element with itself made optional.
rightParen handles a )) in the input.
If the top of the stack is an element followed by an opVerticalBar
swapVerticalBar swaps the two and returns true.
Otherwise it returns false.
verticalBar handles a || in the input.
words handles a block of words in the input.
A reProg is a regexp program: an instruction list.
string returns a textual listing of the given program.
The dictionary d supplies the actual words for the listing.
func reCompileMulti(list []reProg) reProg
func nfaStart(prog reProg) nfaState
func reCompileDFA(prog reProg) reDFA
func reCompileMulti(list []reProg) reProg
A reSyntax is a regexp syntax tree.
// wildcard count (opWild)
// opcode
// subexpressions (opConcat, opAlternate, opWild, opQuest)
// words (opWords)
compile appends a program for the regular expression re to init and returns the result.
A successful match of the program for re will report the match value m.
leadingPhrases returns the set of possible initial phrases
in any match of the given re syntax.
string returns a text form for the regexp syntax.
The dictionary d supplies the word literals.
func reParse(d *Dict, s string, strict bool) (*reSyntax, error)
func canMatchEmpty(re *reSyntax) bool
func rePrint(b *bytes.Buffer, re *reSyntax, d *Dict)
Package-Level Functions (total 25, in which 2 are exported)
NewMultiLRE returns a MultiLRE looking for the given LREs.
All the LREs must have been parsed using the same Dict;
if not, NewMultiLRE panics.
ParseLRE parses the string s as a license regexp.
The file name is used in error messages if non-empty.
appendFoldRune appends foldRune(r) to buf and returns the updated buffer.
atBOL reports whether i is at the beginning of a line (ignoring spaces) in s.
atEOL reports whether i is at the end of a line (ignoring spaces) in s.
canMatchEmpty reports whether re can match an empty text.
canMisspell reports whether want can be misspelled as have.
Both words have been converted to lowercase already
(want by the Dict, have by the caller).
canMisspellJoin reports whether want can be misspelled as the word pair have1, have2.
All three words have been converted to lowercase already
(want by the Dict, have1, have2 by the caller).
foldRune returns the folded rune r.
It returns -1 if the rune r should be omitted entirely.
Folding can be any canonicalizing transformation we want.
For now folding means:
- fold to consistent case (unicode.SimpleFold, but moving to lower-case afterward)
- return -1 for (drop) combining grave and acute U+0300, U+0301
- strip pre-combined graves and acutes on vowels:
é to e, etc. (for Canadian or European licenses
mentioning Québec or Commissariat à l'Energie Atomique)
If necessary we could do a full Unicode-based conversion,
but that will require more thought about exactly what to do
and doing it efficiently. For now, the accents are enough.
htmlEntitySize returns the length of the HTML entity expression at the start of t, or else 0.
htmlTagSize returns the length of the HTML tag at the start of t, or else 0.
isWordContinue reports whether r can appear in a word, after the start.
isWordStart reports whether r can appear at the start of a word.
markdownAnchorSize returns the length of the Markdown anchor at the start of t, or else 0.
(like {#head})
markdownLinkSize returns the length of the Markdown link target at the start of t, or else 0.
Instead of fully parsing Markdown, this looks for ](http:// or ](https://.
nfaStart returns the start state for the NFA executing prog.
nl guarantees b ends with a complete, non-empty line with no trailing spaces
or has no lines at all.
reCompileDFA compiles prog into a DFA.
reCompileMulti returns a program that matches any of the listed regexps.
The regexp list[i] returns match value i when it matches.
reParse parses a license regexp s
and returns a reSyntax parse tree.
reParse adds words to the dictionary d,
so it is not safe to call reParse from concurrent goroutines
using the same dictionary.
If strict is false, the rules about operators at the start or end of line are ignored,
to make trivial test expressions easier to write.
rePrint prints re to b, using d for words.
reSyntaxError returns a *SyntaxError with context.
Package-Level Variables (total 4, in which 1 are exported)
TraceDFA controls whether DFA execution prints debug tracing when stuck.
If TraceDFA > 0 and the DFA has followed a path of at least TraceDFA symbols
since the last matching state but hits a dead end, it prints out information
about the dead end.
canonicalRewrites is a list of pairs that are canonicalized during word splittting.
The words on the right are parsed as if they were the words on the left.
This happens during dictionary splitting, so canMisspell will never see any
of the words on the right.
The pages are generated with Goldsv0.3.2-preview. (GOOS=darwin GOARCH=amd64)
Golds is a Go 101 project developed by Tapir Liu.
PR and bug reports are welcome and can be submitted to the issue list.
Please follow @Go100and1 (reachable from the left QR code) to get the latest news of Golds.