Package: github.com/google/licensecheck/internal/match

package match

Import Path
	github.com/google/licensecheck/internal/match (on go.dev)

Dependency Relation
	imports 10 packages, and imported by one package

Involved Source Files

	   d➜ dict.go
		Package match defines matching algorithms and support code for the license checker.

	      regexp.go
	      rematch.go
	      resyntax.go
Package-Level Type Names (total 20, in which 8 are exported)

	/* sort exporteds by: alphabet | popularity */
	 type Dict (struct)
		A Dict maps words to integer indexes in a word list, of type WordID.
		The zero Dict is an empty dictionary ready for use.

		Lookup and Words are read-only operations,
		safe for any number of concurrent calls from multiple goroutines.
		Insert is a write operation; it must not run concurrently with
		any other call, whether to Insert, Lookup, or Words.

		Fields (total 2, neither is exported)
			/* 2 unexporteds ... *//* 2 unexporteds: */
			dict map[string]WordID
				// dict maps word to index in list

			list []string
				// list of known words

		Methods (total 6, in which 5 are exported)
			(*T) Insert(w string) WordID
				Insert adds the word w to the word list, returning its index.
				If w is already in the word list, it is not added again; Insert returns the existing index.

			(*T) InsertSplit(text string) []Word
				InsertSplit splits text into a sequence of lowercase words,
				inserting any new words in the dictionary.

			(*T) Lookup(w string) WordID
				Lookup looks for the word w in the word list and returns its index.
				If w is not in the word list, Lookup returns BadWord.

			(*T) Split(text string) []Word
				Split splits text into a sequence of lowercase words.
				It does not add any new words to the dictionary.
				Unrecognized words are reported as having ID = BadWord.

			(*T) Words() []string
				Words returns the current word list.
				The list is not a copy; the caller can read but must not modify the list.

			/* one unexported ... *//* one unexported: */
			(*T) split(text string, insert bool) []Word
		As Outputs Of (at least 2, both are exported)
			func (*LRE).Dict() *Dict
			func (*MultiLRE).Dict() *Dict
		As Inputs Of (at least 3, in which 1 are exported)
			func ParseLRE(d *Dict, file, s string) (*LRE, error)
			/* 2+ unexporteds ... *//* 2+ unexporteds: */
			func reParse(d *Dict, s string, strict bool) (*reSyntax, error)
			func rePrint(b *bytes.Buffer, re *reSyntax, d *Dict)

	 type LRE (struct)
		An LRE is a compiled license regular expression.

		TODO: Move this comment somewhere non-internal later.

		A license regular expression (LRE) is a pattern syntax intended for
		describing large English texts such as software licenses, with minor
		allowed variations. The pattern syntax and the matching are word-based
		and case-insensitive; punctuation is ignored in the pattern and in
		the matched text.

		The valid LRE patterns are:

			word            - a single case-insensitive word
			__N__           - any sequence of up to N words
			expr1 expr2     - concatenation
			expr1 || expr2  - alternation
			(( expr ))      - grouping
			expr??          - zero or one instances of expr
			//** text **//  - a comment

		To make patterns harder to misread in large texts:

			- || must only appear inside (( ))
			- ?? must only follow (( ))
			- (( must be at the start of a line, preceded only by spaces
			- )) must be at the end of a line, followed only by spaces and ??.

		For example:

			//** https://en.wikipedia.org/wiki/Filler_text **//
			Now is
			((not))??
			the time for all good
			((men || women || people))
			to come to the aid of their __1__.

		Fields (total 6, none are exported)
			/* 6 unexporteds ... *//* 6 unexporteds: */
			dfa reDFA
			dict *Dict
			file string
			onceDFA sync.Once
			prog reProg
			syntax *reSyntax
		Methods (total 4, in which 2 are exported)
			(*T) Dict() *Dict
				Dict returns the Dict used by the LRE.

			(*T) File() string
				File returns the file name passed to ParseLRE.

			/* 2 unexporteds ... *//* 2 unexporteds: */
			(*T) compile()
				compile initializes lre.dfa.
				It is invoked lazily (in Match) because most LREs end up only
				being inputs to a MultiLRE; we never need their DFAs directly.

			(*T) match(text string) bool
				Match reports whether text matches the license regexp.

		As Outputs Of (at least one exported)
			func ParseLRE(d *Dict, file, s string) (*LRE, error)
		As Inputs Of (at least one exported)
			func NewMultiLRE(list []*LRE) (_ *MultiLRE, err error)

	 type Match (struct)
		A Match records the position of a single match in a text.

		Fields (total 3, all are exported)
			End int
				// word index of end of match

			ID int
				// index of LRE in list passed to NewMultiLRE

			Start int
				// word index of start of match


	 type Matches (struct)
		A Matches is a collection of all leftmost-longest, non-overlapping matches in text.

		Fields (total 3, all are exported)
			List []Match
				// the matches

			Text string
				// the entire text

			Words []Word
				// the text, split into Words

		As Outputs Of (at least one exported)
			func (*MultiLRE).Match(text string) *Matches

	 type MultiLRE (struct)
		A MultiLRE matches multiple LREs simultaneously against a text.
		It is more efficient than matching each LRE in sequence against the text.

		Fields (total 3, none are exported)
			/* 3 unexporteds ... *//* 3 unexporteds: */
			dfa reDFA
				// compiled DFA for all LREs

			dict *Dict
				// dict shared by all LREs

			start map[phrase]struct{}
				start contains the two-word phrases
				where a match can validly start,
				to allow for faster scans over non-license text.

		Methods (total 2, both are exported)
			(*T) Dict() *Dict
				Dict returns the Dict used by the MultiLRE.

			(*T) Match(text string) *Matches
				Match reports all leftmost-longest, non-overlapping matches in text.
				It always returns a non-nil *Matches, in order to return the split text.
				Check len(matches.List) to see whether any matches were found.

		As Outputs Of (at least one exported)
			func NewMultiLRE(list []*LRE) (_ *MultiLRE, err error)

	 type SyntaxError (struct)
		A SyntaxError reports a syntax error during parsing.

		Fields (total 4, all are exported)
			Context string
			Err string
			File string
			Offset int
		Methods (only one, which is exported)
			(*T) Error() string
		Implements (at least one exported)
			*T : error

	 type Word (struct)
		A Word represents a single word found in a text.

		Fields (total 3, all are exported)
			Hi int32
			ID WordID
			Lo int32
				// Word appears at text[Lo:Hi].

		As Outputs Of (at least 3, in which 2 are exported)
			func (*Dict).InsertSplit(text string) []Word
			func (*Dict).Split(text string) []Word
			/* at least one unexported ... *//* at least one unexported: */
			func (*Dict).split(text string, insert bool) []Word

	 type WordID int32 (basic type)
		A WordID is the index of a word in a dictionary.

		As Outputs Of (at least 2, both are exported)
			func (*Dict).Insert(w string) WordID
			func (*Dict).Lookup(w string) WordID
		As Inputs Of (at least one unexported)
			/* at least one unexported ... *//* at least one unexported: */
			func sortWordIDs(x []WordID)
		As Types Of (total 2, both are exported)
			const AnyWord
			const BadWord

	/* 12 unexporteds ... *//* 12 unexporteds: */	 type dfaBuilder (struct)
		A dfaBuilder holds state for building a DFA from a reProg.

		Fields (total 4, none are exported)
			/* 4 unexporteds ... *//* 4 unexporteds: */
			dfa reDFA
				// DFA so far

			enc []byte
				// encoding buffer

			have map[string]int
				// map from encoded NFA state to dfa array offset

			prog reProg
				// program being processed

		Methods (only one, which is unexported)
			/* one unexported ... *//* one unexported: */
			(*T) add(s nfaState) int32
				add returns the offset of the NFA state s in the DFA b.dfa,
				adding it to the end of the DFA if needed.


	 type instOp int32 (basic type)
		An instOp is the opcode for a regexp instruction.

		As Types Of (total 7, none are exported)
			/* 7 unexporteds ... *//* 7 unexporteds: */
			const instAlt
			const instAny
			const instCut
			const instInvalid
			const instJump
			const instMatch
			const instWord

	 type nfaState ([]T)
		An nfaState represents the state of the NFA - all possible instruction locations -
		after reading a particular input.

		Methods (total 6, none are exported)
			/* 6 unexporteds ... *//* 6 unexporteds: */
			(*T) add(prog reProg, pc int32)
				add adds pc and other states reachable from it
				to the set of possible instruction locations in *s.

			( T) appendEncoding(enc []byte) []byte
				appendEncoding appends a byte encoding of the state s to enc and returns the result.

			( T) match(prog reProg) int32
				match returns the smallest match value of matches reached in state s,
				or -1 if there is no match.

			( T) next(prog reProg, w WordID) nfaState
				next returns the new state that results from reading word w in state s,
				and whether a match has been belatedly detected just before w.

			(*T) trim(prog reProg)
				trim canonicalizes *s by sorting it and removing unnecessary states.
				All that must be preserved between input tokens are the instruction
				locations that advance the input (instWord and instAny) or that
				report a match (instMatch).

			( T) words(prog reProg) []WordID
				words returns the list of distinct words that can
				lead the NFA out of state s and into a new state.
				The returned list is sorted in increasing order.
				If the state can match any word (using instAny),
				the word ID AnyWord is first in the list.

		As Outputs Of (at least one unexported)
			/* at least one unexported ... *//* at least one unexported: */
			func nfaStart(prog reProg) nfaState

	 type phrase ([...]T)
		A phrase is a phrase of up to two words.
		The zero-word phrase is phrase{NoWord, NoWord}.
		A single-word phrase w is phrase{w, NoWord}.


	 type reCompile (struct)
		reCompile holds compilation state for a single regexp.

		Fields (total 4, none are exported)
			/* 4 unexporteds ... *//* 4 unexporteds: */
			cut []reCut
			endPattern bool
				// compiling the end of the pattern

			err error
				// first problem found; report delayed until end of compile

			prog reProg
				// program being constructed

		Methods (total 5, none are exported)
			/* 5 unexporteds ... *//* 5 unexporteds: */
			(*T) compile(re *reSyntax)
				compile appends the compiled program for re to c.prog.

			(*T) compileCut(cut reCut)
				compileCut emits an instCut instruction for cut.

			(*T) compileCuts()
				compileCuts emits instCut instructions for all pending cuts.
				See comment at top of file for information about cuts.

			(*T) mergeCut(cut1, cut2 []reCut) []reCut
				mergeCut merges the two cut lists cut1 and cut2 into a single cut list.
				Cuts with the same start but different triggers are merged into a
				single entry with the larger of the two triggers.

			(*T) reduceCut()
				reduceCut records that a new literal word has been matched,
				reducing the triggers in c.cut by 1 and emitting any triggered cuts.


	 type reCut (struct)
		reCut holds the information about a pending cut.

		Fields (total 2, neither is exported)
			/* 2 unexporteds ... *//* 2 unexporteds: */
			start int
				// cut off the alt at pc = start

			trigger int
				// ... after trigger more literal word matches


	 type reDFA ([]T)
		A reDFA is an encoded DFA over word IDs.

		The encoded DFA is a sequence of encoded DFA states, packed together.
		Each DFA state is identified by the index where it starts in the slice.
		The initial DFA state is at the start of the slice, index 0.

		Each DFA state records whether reaching that state counts as matching
		the input, which of multiple regexps matched, and then the transition
		table for the possible words that lead to new states. (If a word is found
		that is not in the current state's transition table, the DFA stops immediately
		with no match.)

		The encoding of this state information is:

			-  a one-word header M | N<<1, where M is 0 for a non-match, 1 for a match,
			   and N is the number of words in the table.
			   This header is conveniently also the number of words that follow in the encoding.

			- if M == 1, a one-word value V that is the match value to report,
			  identifying which of a set of regexps has been matched.

			- N two-word pairs W:NEXT indicating that if word W is seen, the DFA should
			  move to the state at offset NEXT. The pairs are sorted by W. An entry for W == AnyWord
			  is treated as matching any input word; an exact match later in the list takes priority.
			  The list is sorted by W, so AnyWord is always first if present.

		Methods (total 3, none are exported)
			/* 3 unexporteds ... *//* 3 unexporteds: */
			( T) match(dict *Dict, text string, words []Word) (match int32, end int)
				match looks for a match of DFA at the start of words,
				which are the result of dict.Split(text) or a subslice of it.
				match returns the match ID of the longest match, as well as
				the index in words immediately following the last matched word.
				If there is no match, match returns -1, 0.

			( T) stateAt(off int32) (match int32, delta []int32)
				stateAt returns (partly) decoded information about the
				DFA state at the given offset.
				If the state is a matching state, stateAt returns match >= 0 specifies the match ID.
				If the state is not a matching state, stateAt returns match == -1.
				Either way, stateAt also returns the outgoing transition list
				interlaced in the delta slice. The caller can iterate over delta using:

					for i := 0; i < len(delta); i += 2 {
						dw, dnext := WordID(delta[i]), delta[i+1]
						if currentWord == dw {
							off = dnext
						}
					}

			( T) string(d *Dict) string
				string returns a textual listing of the DFA.
				The dictionary d supplies the actual words for the listing.

		As Outputs Of (at least one unexported)
			/* at least one unexported ... *//* at least one unexported: */
			func reCompileDFA(prog reProg) reDFA

	 type reInst (struct)
		A reInst is a regexp instruction: an opcode and a numeric argument

		Fields (total 2, neither is exported)
			/* 2 unexporteds ... *//* 2 unexporteds: */
			arg int32
			op instOp

	 type reOp int (basic type)
		A reOp is the opcode for a regexp syntax tree node.

		As Types Of (total 10, none are exported)
			/* 10 unexporteds ... *//* 10 unexporteds: */
			const opAlternate
			const opConcat
			const opEmpty
			const opLeftParen
			const opNone
			const opPseudo
			const opQuest
			const opVerticalBar
			const opWild
			const opWords

	 type reParser (struct)
		A reParser is the regexp parser state.

		Fields (total 2, neither is exported)
			/* 2 unexporteds ... *//* 2 unexporteds: */
			dict *Dict
			stack []*reSyntax
		Methods (total 9, none are exported)
			/* 9 unexporteds ... *//* 9 unexporteds: */
			(*T) alternate() *reSyntax
				alternate replaces the top of the stack (above the topmost '((') with its alternation.

			(*T) collapse(op reOp, subs []*reSyntax) *reSyntax
				collapse returns the result of applying op to sub.
				If sub contains op nodes, they all get hoisted up
				so that there is never a concat of a concat or an
				alternate of an alternate.

			(*T) concat() *reSyntax
				concat replaces the top of the stack (above the topmost '||' or '((') with its concatenation.

			(*T) push(re *reSyntax) *reSyntax
				push pushes the regexp re onto the parse stack and returns the regexp.

			(*T) quest() error
				quest replaces the top stack element with itself made optional.

			(*T) rightParen() error
				rightParen handles a )) in the input.

			(*T) swapVerticalBar() bool
				If the top of the stack is an element followed by an opVerticalBar
				swapVerticalBar swaps the two and returns true.
				Otherwise it returns false.

			(*T) verticalBar() error
				verticalBar handles a || in the input.

			(*T) words(text, next string)
				words handles a block of words in the input.


	 type reProg ([]T)
		A reProg is a regexp program: an instruction list.

		Methods (only one, which is unexported)
			/* one unexported ... *//* one unexported: */
			( T) string(d *Dict) string
				string returns a textual listing of the given program.
				The dictionary d supplies the actual words for the listing.

		As Outputs Of (at least one unexported)
			/* at least one unexported ... *//* at least one unexported: */
			func reCompileMulti(list []reProg) reProg
		As Inputs Of (at least 3, none are exported)
			/* 3+ unexporteds ... *//* 3+ unexporteds: */
			func nfaStart(prog reProg) nfaState
			func reCompileDFA(prog reProg) reDFA
			func reCompileMulti(list []reProg) reProg

	 type reSyntax (struct)
		A reSyntax is a regexp syntax tree.

		Fields (total 4, none are exported)
			/* 4 unexporteds ... *//* 4 unexporteds: */
			n int32
				// wildcard count (opWild)

			op reOp
				// opcode

			sub []*reSyntax
				// subexpressions (opConcat, opAlternate, opWild, opQuest)

			w []WordID
				// words (opWords)

		Methods (total 3, none are exported)
			/* 3 unexporteds ... *//* 3 unexporteds: */
			(*T) compile(init reProg, m int32) (reProg, error)
				compile appends a program for the regular expression re to init and returns the result.
				A successful match of the program for re will report the match value m.

			(*T) leadingPhrases() []phrase
				leadingPhrases returns the set of possible initial phrases
				in any match of the given re syntax.

			(*T) string(d *Dict) string
				string returns a text form for the regexp syntax.
				The dictionary d supplies the word literals.

		As Outputs Of (at least one unexported)
			/* at least one unexported ... *//* at least one unexported: */
			func reParse(d *Dict, s string, strict bool) (*reSyntax, error)
		As Inputs Of (at least 2, neither is exported)
			/* 2+ unexporteds ... *//* 2+ unexporteds: */
			func canMatchEmpty(re *reSyntax) bool
			func rePrint(b *bytes.Buffer, re *reSyntax, d *Dict)


Package-Level Functions (total 25, in which 2 are exported)

	 func NewMultiLRE(list []*LRE) (_ *MultiLRE, err error)
		NewMultiLRE returns a MultiLRE looking for the given LREs.
		All the LREs must have been parsed using the same Dict;
		if not, NewMultiLRE panics.

	 func ParseLRE(d *Dict, file, s string) (*LRE, error)
		ParseLRE parses the string s as a license regexp.
		The file name is used in error messages if non-empty.

	/* 23 unexporteds ... *//* 23 unexporteds: */	 func appendFoldRune(buf []byte, r rune) []byte
		appendFoldRune appends foldRune(r) to buf and returns the updated buffer.

	 func atBOL(s string, i int) bool
		atBOL reports whether i is at the beginning of a line (ignoring spaces) in s.

	 func atEOL(s string, i int) bool
		atEOL reports whether i is at the end of a line (ignoring spaces) in s.

	 func canMatchEmpty(re *reSyntax) bool
		canMatchEmpty reports whether re can match an empty text.

	 func canMisspell(want, have string) bool
		canMisspell reports whether want can be misspelled as have.
		Both words have been converted to lowercase already
		(want by the Dict, have by the caller).

	 func canMisspellJoin(want, have1, have2 string) bool
		canMisspellJoin reports whether want can be misspelled as the word pair have1, have2.
		All three words have been converted to lowercase already
		(want by the Dict, have1, have2 by the caller).

	 func foldRune(r rune) rune
		foldRune returns the folded rune r.
		It returns -1 if the rune r should be omitted entirely.

		Folding can be any canonicalizing transformation we want.
		For now folding means:
			- fold to consistent case (unicode.SimpleFold, but moving to lower-case afterward)
			- return -1 for (drop) combining grave and acute U+0300, U+0301
			- strip pre-combined graves and acutes on vowels:
				é to e, etc. (for Canadian or European licenses
				mentioning Québec or Commissariat à l'Energie Atomique)

		If necessary we could do a full Unicode-based conversion,
		but that will require more thought about exactly what to do
		and doing it efficiently. For now, the accents are enough.

	 func htmlEntitySize(t string) int
		htmlEntitySize returns the length of the HTML entity expression at the start of t, or else 0.

	 func htmlTagSize(t string) int
		htmlTagSize returns the length of the HTML tag at the start of t, or else 0.

	 func isWordContinue(r rune) bool
		isWordContinue reports whether r can appear in a word, after the start.

	 func isWordStart(r rune) bool
		isWordStart reports whether r can appear at the start of a word.

	 func markdownAnchorSize(t string) int
		markdownAnchorSize returns the length of the Markdown anchor at the start of t, or else 0.
		(like {#head})

	 func markdownLinkSize(t string) int
		markdownLinkSize returns the length of the Markdown link target at the start of t, or else 0.
		Instead of fully parsing Markdown, this looks for ](http:// or ](https://.

	 func nfaStart(prog reProg) nfaState
		nfaStart returns the start state for the NFA executing prog.

	 func nl(b *bytes.Buffer)
		nl guarantees b ends with a complete, non-empty line with no trailing spaces
		or has no lines at all.

	 func reCompileDFA(prog reProg) reDFA
		reCompileDFA compiles prog into a DFA.

	 func reCompileMulti(list []reProg) reProg
		reCompileMulti returns a program that matches any of the listed regexps.
		The regexp list[i] returns match value i when it matches.

	 func reParse(d *Dict, s string, strict bool) (*reSyntax, error)
		reParse parses a license regexp s
		and returns a reSyntax parse tree.
		reParse adds words to the dictionary d,
		so it is not safe to call reParse from concurrent goroutines
		using the same dictionary.
		If strict is false, the rules about operators at the start or end of line are ignored,
		to make trivial test expressions easier to write.

	 func rePrint(b *bytes.Buffer, re *reSyntax, d *Dict)
		rePrint prints re to b, using d for words.

	 func reSyntaxError(s string, i int, err error) error
		reSyntaxError returns a *SyntaxError with context.

	 func sortInt32s(x []int32)
	 func sortWordIDs(x []WordID)
	 func toFold(s string) string
		toFold converts s to folded form.


Package-Level Variables (total 4, in which 1 are exported)

	  var TraceDFA int
		TraceDFA controls whether DFA execution prints debug tracing when stuck.
		If TraceDFA > 0 and the DFA has followed a path of at least TraceDFA symbols
		since the last matching state but hits a dead end, it prints out information
		about the dead end.

	/* 3 unexporteds ... *//* 3 unexporteds: */	  var canonicalRewrites []struct{x string; y string}
		canonicalRewrites is a list of pairs that are canonicalized during word splittting.
		The words on the right are parsed as if they were the words on the left.
		This happens during dictionary splitting, so canMisspell will never see any
		of the words on the right.

	  var copyright []byte
		© is rewritten to this text.

	  var markdownLinkPrefixes []string
Package-Level Constants (total 19, in which 2 are exported)

	const AnyWord WordID = -2
		AnyWord represents a wildcard matching any word.

	const BadWord WordID = -1
		BadWord represents a word not present in the dictionary.

	/* 17 unexporteds ... *//* 17 unexporteds: */	const instAlt instOp = 3 // jump to both pc+1 and pc+1+arg
	const instAny instOp = 2 // match any word
	const instCut instOp = 6 // cut off the instAlt range starting at pc+1+arg
	const instInvalid instOp = 0
	const instJump instOp = 4 // jump to pc+1+arg
	const instMatch instOp = 5 // completed match identified by arg
	const instWord instOp = 1 // match specific word
	const opAlternate reOp = 5
	const opConcat reOp = 4
	const opEmpty reOp = 2
	const opLeftParen reOp = 9
	const opNone reOp = 1
	const opPseudo reOp = 8
		pseudo-ops during parsing

	const opQuest reOp = 7
	const opVerticalBar reOp = 10
	const opWild reOp = 6
	const opWords reOp = 3


The pages are generated with Golds v0.3.2-preview. (GOOS=darwin GOARCH=amd64)
Golds is a Go 101 project developed by Tapir Liu.
PR and bug reports are welcome and can be submitted to the issue list.
Please follow @Go100and1 (reachable from the left QR code) to get the latest news of Golds.