In this chapter, we develop a top-down, LL parser that can detect context-sensitive grammars.
A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure – giving a structural representation of the input, checking for correct syntax in the process.
Parser[+A]
type Parser[+A] =
Location => XorErrors[String, (A, Int)]
Developed from the same functional design concepts that gave us:
type Rand[A] = RNG => (A, RNG)
Parser[+A]
case class Location(
input: String, offset: Integer) {
...
}
A Parser[+A]
will scan its input from a given offset.
Location("foobar", 3)
-- a Parser
will look at "bar".
This contrasts with an alternative approach that makes offset
unnecessary: cutting down the input string. Then the Parser
's feedback is less informative.
Alternative - cutting off the consumed characters of the string and passing along the tail. The first part of the string is lost if an error occurs down the line.
Parser[+A]
Left: Parser rejects input
Right: Parser accepts input
XorErrors[String, (A, Int)]
// equivalent to
Xor[NonEmptyList[String], (A, Int)]
Inside Left
:
NonEmptyList[String]
accumulates errors.
These String
s should explain why Parser[+A]
rejected its input.
Inside Right
:
(A, Int)
holds a token of type A
and the number of characters consumed to create this token.
We will develop various types of token.
Parser
PrimitiveHere, the token is a String
.
def string(detect: String): Parser[String] =
(loc: Location) => {
val matches: Boolean =
detect.regionMatches(0, loc.input,
loc.offset, detect.length())
if(matches)
Right((detect, detect.length()))
else
Left(NonEmptyList(
s"$detect not in ${loc.input}
at offset ${loc.offset}"
))
}
run
Our Parser
s require a Location
as input. This wraps the input in a Location
.
def run[A](parserA: Parser[A])(input: String):
XorErrors[String, (A, Int)] =
parserA(Location(input,0))
val detectFoo: Parser[String] = string("foo")
val document = "foobar"
val resultFoo:
XorErrors[String, (String, Int)] =
run(detectFoo)(document)
// `parserResultToString`
// formats the output `Xor` nicely
println(parserResultToString(resultFoo))
[info] Running slideCode.lecture9.SimpleParser...
document
foobar
detect 'foo'
Right: Accept. token tree = foo |
chars consumed = 3
We need an "end of field" Parser; equivalent to the $
or \z
regular expression.
Regex
typescala> "\\z".r
res1: scala.util.matching.Regex = \z
scala> "\\z".r.regex
res2: String = \z
Regex
type// input example: "\\z".r
def regex(r: Regex): Parser[String] =
string(r.regex)
regex
constructs a Parser[String]
that will accept a given Scala Regex
val eof: Parser[String] =
regex("\\z".r)
Now we have an "end of field" Parser, equivalent to "$"
detectFoo
and eof
?val detectFoo: Parser[String] = string("foo")
val eof: Parser[String] = regex("\\z".r)
product
product
puts two parsers in sequence.
def product[A,B](parserA: Parser[A],
parserB: => Parser[B]):
Parser[(A,B)] = {
def f(a: A, b: => B): (A,B) = (a,b)
map2(parserA, parserB)(f)
}
If both parsers accept their input, they will produce tokens A
and B
.
Parser[A]
consumes as much as it needs to, then sets the offset
for Parser[B]
.
map2
and its dependencies will be shown later.
val detectFooEOF = product(detectFoo, eof)
val resultFooEOF = run(detectFooEOF)(document)
println(parserResultToString(resultFooEOF))
document
foobar
detect 'foo' with end-of-field
Left: Reject.
Parse errors: \z not in foobar at offset 3
def flatMap[A,B](parserA: Parser[A])
(aParserB: A => Parser[B]): Parser[B] = ...
def map[A,B](parserA: Parser[A])
(f: A => B): Parser[B] = ...
def map2[A,B,C](parserA: Parser[A],
parserB: => Parser[B])
(f: (A,=>B)=>C): Parser[C] = ...
Rand
Recall that unit
for Rand
always generated the same "random" value.
def unit[A](a: A): Rand[A] =
rng => (a, rng)
Analogously, unit
for Parser
always accepts.
def succeed[A](a: A): Parser[A] =
(loc: Location) => Right((a, 0))
def unit[A](a: A): Parser[A] = succeed(a)
Rand
Recall that with the use of the Rand
combinators,
RNG
was passed implicitly. One fewer place to make an error -- mishandling RNG
.
type Rand[A] = RNG => (A, RNG)
In Parser
,
Location(input, offset)
is passed implicitly.
type Parser[+A] =
Location => XorErrors[String, (A, Int)]
The right side of the Xor
is used to increment the Location
-- consuming characters of the input string.
def advanceParserResult[A](
xor: XorErrors[String,(A,Int)], consumed: Int):
XorErrors[String, (A, Int)] = ...
We mentioned earlier:
"Parser[A]
consumes as much as it needs to, then sets the offset
for Parser[B]
."
def product[A,B](parserA: Parser[A],
parserB: => Parser[B]):
Parser[(A,B)] = {
def f(a: A, b: => B): (A,B) = (a,b)
map2(parserA, parserB)(f)
}
Do you see Location
or offset
anywhere in here? It is passed implicitly.
def or[A](p1: Parser[A], p2: => Parser[A]):
Parser[A] = ...
or
is our first clue that we are building an LL Parser, for the leftmost accepted derivation of the syntax tree.
We give priority to the left input of or
.
sealed trait Alphabet
case class X(nested: Alphabet) extends Alphabet
case class Y(nested: Alphabet) extends Alphabet
case object Z extends Alphabet
"Start" -> X
X -> xY
Y -> yX | yZ
Z -> z
"xyz" into X(Y(Z))
"xyxyxyz" into X(Y(X(Y(X(Y(Z))))))
"xyxyxyxyxyz" into X(Y(X(Y(X(Y(X(Y(X(Y(Z))))))))))
"xxxyyyz"
"xyyyz"
(even though the X
, Y
and Z
tokens can contain these strings)
document: xyxyxyxyz
Right: Accept.
token tree = X(Y(X(Y(X(Y(X(Y(Z))))))))
chars consumed = 9
------------
document: xyyyxyz
Left: Reject.
Parse errors:
x not in xyyyxyz at offset 2
z not in xyyyxyz at offset 2
sealed trait Alphabet
case class AC(nested: Alphabet) extends Alphabet {
override def toString = "A"+nested.toString+"C"
}
case object B extends Alphabet {
override def toString = "B"
}
"Start" -> aBc
B -> aBc | b
"abc" into AC(B)
"aabcc" into AC(AC(B))
"aaabcc"
"aabbcc"
// Sequences two parsers,
// ignoring the result of the first.
def skipL[B](p: Parser[Any], p2: => Parser[B]):
Parser[B] =
map2(p, p2)((_,b) => b)
// Sequences two parsers,
// ignoring the result of the second.
def skipR[A](p: Parser[A], p2: => Parser[Any]):
Parser[A] =
map2(p, p2)((a,_) => a)
def surround[A](left: Parser[Any],
right: Parser[Any])
(middle: => Parser[A]): Parser[A] =
skipL(left, skipR(middle, right))
// necessary for "aBc"
Parsing without EOF at end
[info] Running slideCode.lecture9.ABC
document: abc
Right: Accept. token tree = ABC |
chars consumed = 3
---------------------------
document: abcccc
Right: Accept. token tree = ABC |
chars consumed = 3
Parsing with EOF at end
document: abcccc
Left: Reject. Parse errors: \z not in abcccc
at offset 3
---------------------------
document: aaabccc
Right: Accept. token tree = AAABCCC |
chars consumed = 7
We've attempted to emphasize how more complex grammars require more complex combinators
Just as complex grammars build on simple grammars, complex combinators build on simple combinators
The chapter is primarily about functional design
It is a difficult first introduction to parsing and grammars
Commiting
is an important feature we have omitted. It gives the user of the parser control over backtracking.
Our simplified implementation hard-wires a Parser
as:
type Parser[+A] =
Location => XorErrors[String, (A, Int)]
The book's implementation is considerable more complicated.
In the book, the type of Parser
changes frequently in the chapter to demonstrate top-down reasoning about possible implementations, and algebraic design. Different Parser[+A]
inputs and outputs are explored.
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |