So far, your web browser sees web pages as a stream of open tags, close tags, and text. But HTML is actually a tree, and though the tree structure hasn’t been important yet, it will be once backgrounds, margins, and CSS enter the picture. So this chapter adds a proper HTML parser and converts the layout engine to use it.
The HTML treeThis is
the tree that is usually called the DOM tree, for Document
Object Model. We’ll keep calling it the HTML tree for
now. has one node for each open and close tag pair and for
each span of text.In
reality there are other types of nodes too, like comments, doctypes, and
CDATA
sections, and processing instructions. There are even
some deprecated types! So for our browser to be a tree,
tokens need to evolve into nodes. That means adding a list of children
and a parent pointer to each one. Here’s the new Text
class:The
children
field of a Text
node will always be
empty; I’m defining it here to make it easier to write code that handles
Text
and Element
nodes
simultaneously.
class Text:
def __init__(self, text, parent):
self.text = text
self.children = []
self.parent = parent
Since it takes two tags (the open and the close tag) to make a node,
let’s rename the Tag
class to Element
, and
make it look like this:
class Element:
def __init__(self, tag, parent):
self.tag = tag
self.children = []
self.parent = parent
I added a children
field to both Text
and
Element
, even though text nodes never have children. That’s
for consistency, to avoid isinstance
calls throughout the
code.
Constructing a tree of nodes from source code is called parsing. A parser builds a tree one element or text node at a time. But that means the parser needs to store an incomplete tree. For example, suppose the parser has so far read this bit of HTML:
<html><head></head><body><h1>This is my webpage
The parser has seen five tags (and one text node). The rest of the
HTML will contain more open tags, close tags, and text; but no matter
which tokens it sees, no new nodes will be added to the
<head>
tag, which has already been closed. So that
node is “finished”. But the other nodes are unfinished: more children
can be added to the <html>
,
<body>
, and <h1>
nodes, depending
on what HTML comes next.
Since the parser reads the HTML file from left to right, these unfinished tags are always in a certain part of the tree. The unfinished tags have always been opened but not yet closed; they are always to the right of the finished nodes; and they are always children of other unfinished tags. To leverage these facts, let’s represent an incomplete tree by storing a list of unfinished tags, ordered with parents before children. The first node in the list is the root of the HTML tree; the last node in the list is the most recent unfinished tag.In Python, and most other languages, it’s faster to add and remove from the end of a list, instead of the beginning.
Parsing is a little more complex than lex
, so we’re
going to want to break it into several functions, organized in a new
HTMLParser
class. That class can also store the source code
it’s analyzing and the incomplete tree:
class HTMLParser:
def __init__(self, body):
self.body = body
self.unfinished = []
Before the parser starts, it hasn’t seen any tags at all, so the
unfinished
list storing the tree starts empty. But as the
parser reads tokens, that list fills up. Let’s start that by renaming
the lex
function we have now, aspirationally, to
parse
:
class HTMLParser:
def parse(self):
# ...
We’ll need to do a bit of surgery on parse
. Right now
parse
creates Tag
and Text
objects and appends them to the out
array. We need it to
create Element
and Text
objects and add them
to the unfinished
tree. Since a tree is a bit more complex
than a list, I’ll move the adding-to-a-tree logic to two new methods
add_text
and add_tag
.
def parse(self):
= ""
text = False
in_tag for c in self.body:
if c == "<":
= True
in_tag if text: self.add_text(text)
= ""
text elif c == ">":
= False
in_tag self.add_tag(text)
= ""
text else:
+= c
text if not in_tag and text:
self.add_text(text)
return self.finish()
The out
variable is gone, and note that I’ve also moved
the return value to a new finish
method, which converts the
incomplete tree to the final, complete tree. So: how do we add things to
the tree?
HTML derives from a long line of document processing systems. Its
predecessor, SGML,
traces back to RUNOFF and is
a sibling to troff, now used for Linux
man pages. The committee that
standardized SGML now works on the .odf
,
.docx
, and .epub
formats.
Let’s talk about adding nodes to a tree. To add a text node we add it as a child of the last unfinished node:
class HTMLParser:
def add_text(self, text):
= self.unfinished[-1]
parent = Text(text, parent)
node parent.children.append(node)
On the other hand, tags are a little more complex since they might be an open or a close tag:
class HTMLParser:
def add_tag(self, tag):
if tag.startswith("/"):
# ...
else:
# ...
A close tag removes an unfinished node, by finishing it, and add it to the next unfinished node in the list:
def add_tag(self, tag):
if tag.startswith("/"):
= self.unfinished.pop()
node = self.unfinished[-1]
parent
parent.children.append(node)# ...
An open tag instead adds an unfinished node to the end of the list:
def add_tag(self, tag):
# ...
else:
= self.unfinished[-1]
parent = Element(tag, parent)
node self.unfinished.append(node)
Once the parser is done, it turns our incomplete tree into a complete tree by just finishing any unfinished nodes:
class HTMLParser:
def finish(self):
if len(self.unfinished) == 0:
self.add_tag("html")
while len(self.unfinished) > 1:
= self.unfinished.pop()
node = self.unfinished[-1]
parent
parent.children.append(node)return self.unfinished.pop()
This is almost a complete parser, but it doesn’t quite work at the beginning and end of the document. The very first open tag is an edge case without a parent:
def add_tag(self, tag):
# ...
else:
= self.unfinished[-1] if self.unfinished else None
parent # ...
The very last tag is also an edge case, because there’s no unfinished node to add it to:
def add_tag(self, tag):
if tag.startswith("/"):
if len(self.unfinished) == 1: return
# ...
Ok, that’s all done. Let’s test our parser out and see how well it works!
The ill-considered Javascript document.write
method
allows Javascript to modify the HTML source code while it’s being
parsed! Modern browsers use speculative
parsing to make this fast and avoid evaluating Javascript while
parsing.
How do we know our parser does the right thing—that it builds the right tree? Well the place to start is seeing the tree it produces. We can do that with a quick, recursive pretty-printer:
def print_tree(node, indent=0):
print(" " * indent, node)
for child in node.children:
+ 2) print_tree(child, indent
Here we’re printing each node in the tree, and using indentation to
show the tree structure. Since we need to print each node, it’s worth
taking the time to give them a nice printed form, which in Python means
defining the __repr__
function:
class Text:
def __repr__(self):
return repr(self.text)
class Element:
def __repr__(self):
return "<" + self.tag + ">"
Try this out on this web page, parsing the HTML source code and then
calling print_tree
to visualize it:
= request(sys.argv[1])
headers, body = HTMLParser(body).parse()
nodes print_tree(nodes)
Run it on this web page, and you’ll see something like this:
<!doctype html>
'\n'
<html lang="en-US" xml:lang="en-US">
'\n'
<head>
'\n '
<meta charset="utf-8" />
'\n '
<meta name="generator" content="pandoc" />
'\n '
Immediately a couple of things stand out. Let’s start at the top,
with the <!doctype html>
tag.
This special tag, called a doctype, is always the very first thing in an HTML document. But it’s not really an element at all, nor is it supposed to have a close tag. Our toy browser won’t be using the doctype for anything, so it’s best to throw it away:Real browsers use doctypes to switch between standards-compliant and legacy parsing and layout modes.
def add_tag(self, tag):
if tag.startswith("!"): return
# ...
This ignores all tags that start with an exclamation mark, which not
only throws out doctype declarations but also most comments, which in
HTML are written <!-- comment text -->
.
Just throwing out doctypes isn’t quite enough though—if you run your
parser now, it will crash. That’s because after the doctype comes a
newline, which our parser treats as text and tries to insert into the
tree. Except there isn’t a tree, since the parser hasn’t seen any open
tags. For simplicity, let’s just have our browser skip whitespace-only
text nodes to side-step the problem:Real browsers retain whitespace to correctly render
make<span></span>up
as one word and
make<span> </span>up
as two. Our browser won’t.
Plus, ignoring whitespace simplifies later
chapters by avoiding a special-case for whitespace-only text
tags.
def add_text(self, text):
if text.isspace(): return
# ...
The parsed HTML tree now looks like this:
<html lang="en-US" xml:lang="en-US">
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width,initial-scale=1.0,user-scalable=yes" />
<meta name="author" content="Pavel Panchekha & Chris Harrelson" />
<link rel="stylesheet" href="book.css" />
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Vollkorn%7CLora&display=swap" />
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Vollkorn:400i%7CLora:400i&display=swap" />
<title>
Why’s everything so deeply indented? Why aren’t these open elements ever closed?
In SGML, document type declarations had a URL to define the valid
tags. Browsers use the absence of a document type declaration to identify
very old, pre-SGML versions of HTML,There’s also this crazy thing called “almost standards” or “limited
quirks” mode, due to a backwards-incompatible change in table cell
vertical layout. Yes. I don’t need to make these up! but
don’t use the URL, so <!doctype html>
is the best
document type declaration for HTML.
Elements like <meta>
and <link>
are what are called self-closing: these tags don’t surround content, so
you don’t ever write </meta>
or
</link>
. Our parser needs special support for them.
In HTML, there’s a specific
list of these self-closing tags (the spec calls them “void”
tags):A lot of these
tags are obscure. Browsers also support some additional, obsolete
self-closing tags not listed here, like
keygen
.
= [
SELF_CLOSING_TAGS "area", "base", "br", "col", "embed", "hr", "img", "input",
"link", "meta", "param", "source", "track", "wbr",
]
Our parser needs to auto-close tags from this list:
def add_tag(self, tag):
# ...
elif tag in self.SELF_CLOSING_TAGS:
= self.unfinished[-1]
parent = Element(tag, parent)
node parent.children.append(node)
This code is right, but if you test it out it won’t seem to help. Why
not? Our parser is looking for a tag named meta
, but it’s
finding a tag named “meta name=...
”. The self-closing code
isn’t triggered because the <meta>
tag has
attributes.
HTML attributes add information about an element; open tags can have any number of attributes. Attribute values can be quoted, unquoted, or omitted entirely. Let’s focus on basic attribute support, ignoring values that contain whitespace, which are a little complicated.
Since we’re not handling whitespace in values, we can split on whitespace to get the tag name and the attribute-value pairs:
class HTMLParser:
def get_attributes(self, text):
= text.split()
parts = parts[0].lower()
tag = {}
attributes for attrpair in parts[1:]:
# ...
return tag, attributes
HTML tag names are case-insensitive,This is not the right way to do case insensitive comparisons; the Unicode case folding algorithm should be used if you want to handle languages other than English. But in HTML specifically, tag names only use the ASCII characters so lower-casing them is sufficient. as by the way are attribute values, so I convert them to lower case. Then, inside the loop, I split each attribute-value pair into a name and a value. The easiest case is an unquoted attribute, where an equal sign separates the two:
def get_attributes(self, text):
# ...
for attrpair in parts[1:]:
if "=" in attrpair:
= attrpair.split("=", 1)
key, value = value
attributes[key.lower()] # ...
The value can also be omitted, like in
<input disabled>
, in which case the attribute value
defaults to the empty string:
for attrpair in parts[1:]:
# ...
else:
= "" attributes[attrpair.lower()]
Finally, the value can be quoted, in which case the quotes have to be stripped out:Quoted attributes allow whitespace between the quotes. That requires something like a finite state machine instead of just splitting on whitespace.
if "=" in attrpair:
# ...
if len(value) > 2 and value[0] in ["'", "\""]:
= value[1:-1]
value # ...
We’ll store these attributes inside Element
s:
class Element:
def __init__(self, tag, attributes, parent):
self.tag = tag
self.attributes = attributes
# ...
That means we’ll need to call get_attributes
at the top
of add_tag
, to get the attributes
we need to
construct an Element
.
def add_tag(self, tag):
= self.get_attributes(tag) tag, attributes
Remember to use tag
and attribute
instead
of text
in add_tag
, and try your parser
again:
<html>
<head>
<meta>
<meta>
<meta>
<meta>
<link>
<link>
<link>
<title>
It’s close! Yes, if you print the attributes, you’ll see that
attributes with whitespace (like author
on the fourth
meta
tag) are mis-parsed as multiple attributes, and the
final slash on the self-closing tags is incorrectly treated as an extra
attribute. A better parser would fix these issues. But let’s instead
leave our parser as is—these issues aren’t going to be a problem for the
toy browser we’re building—and move on to integrating it with our
browser.
Putting a slash at the end of self-closing tags, like
<br/>
, became fashionable when XHTML looked like it might
replace HTML, and old-timers like me never broke the habit. But unlike
in XML, in HTML
self-closing tags are identified by name, not by some special syntax, so
the slash is optional.
Right now, the Layout
class works token-by-token; we now
want it to go node-by-node instead. So let’s separate the old
token
method into three parts: all the cases for open tags
will go into a new open_tag
method; all the cases for close
tags will go into a new close_tag
method; and instead of
having a case for text tokens our browser can just call the existing
text
method directly:
class Layout:
def open_tag(self, tag):
if tag == "i":
self.style = "italic"
# ...
def close_tag(self, tag):
if tag == "i":
self.style = "roman"
# ...
Now we need the Layout
object to walk the node tree,
calling open_tag
, close_tag
, and
text
in the right order:
def recurse(self, tree):
if isinstance(tree, Text):
self.text(tree)
else:
self.open_tag(tree.tag)
for child in tree.children:
self.recurse(child)
self.close_tag(tree.tag)
The Layout
constructor can now call recurse
instead of looping through the list of tokens. We’ll also need the
browser to construct the node tree, like this:
class Browser:
def load(self, url):
= request(url)
headers, body self.nodes = HTMLParser(body).parse()
self.display_list = Layout(self.nodes).display_list
self.draw()
Run it—the browser should now work off of the parsed HTML tree.
Prior to the invention of CSS, some browsers supported web page
styling using attributes like bgcolor
and
vlink
(the color of visited links) and tags like
font
. These are
obsolete, but browsers still support some of them.
The parser now handles HTML pages correctly—at least when the HTML is
written by the sorts of goody-two-shoes programmers who remember the
<head>
tag, close every open tag, and make their bed
in the morning. Mere mortals lack such discipline and so browsers also
have to handle broken, confusing, headless HTML. In fact, modern HTML
parsers are capable of transforming any string of characters
into an HTML tree, no matter how confusing the markup.Yes, it’s crazy, and for a few
years in the early ’00s the W3C tried to do away with it. They
failed.
The full algorithm is, as you might expect, complicated beyond belief, with dozens of ever-more-special cases forming a taxonomy of human error, but one of its nicer features is implicit tags. Normally, an HTML document starts with a familiar boilerplate:
<!doctype html>
<html>
<head>
</head>
<body>
</body>
</html>
In reality, all six of these tags, except the doctype, are
optional: browsers insert them automatically. Let’s add support for
implicit tags to our browser via a new implicit_tags
function that adds implicit tags when the web page omits them. We’ll
want to call it in both add_text
and
add_tag
:
class HTMLParser:
def add_text(self, text):
if text.isspace(): return
self.implicit_tags(None)
# ...
def add_tag(self, tag):
= self.get_attributes(tag)
tag, attributes if tag.startswith("!"): return
self.implicit_tags(tag)
# ...
Note that implicit_tags
isn’t called for the ignored
whitespace and doctypes. The argument to implicit_tags
is
the tag name (or None
for text nodes), which we’ll compare
to the list of unfinished tags to determine what’s been omitted:
class HTMLParser:
def implicit_tags(self, tag):
while True:
= [node.tag for node in self.unfinished]
open_tags # ...
implicit_tags
has a loop because more than one tag could
have been omitted in a row; every iteration around the loop will add
just one. To determine which implicit tag to add, if any, requires
examining the open tags and the tag being inserted.
Let’s start with the easiest case, the implicit
<html>
tag. An implicit <html>
tag
is necessary if the first tag in the document is something other than
<html>
:
while True:
# ...
if open_tags == [] and tag != "html":
self.add_tag("html")
Both <head>
and <body>
can also
be omitted, but to figure out which it is we need to look at which tag
is being added:
while True:
# ...
elif open_tags == ["html"] \
and tag not in ["head", "body", "/html"]:
if tag in self.HEAD_TAGS:
self.add_tag("head")
else:
self.add_tag("body")
Here, HEAD_TAGS
lists the tags that you’re supposed to
put into the <head>
element:The
<script>
tag can go in either the head or the body
section, but it goes into the head by default.
class HTMLParser:
= [
HEAD_TAGS "base", "basefont", "bgsound", "noscript",
"link", "meta", "title", "style", "script",
]
Note that if both the <html>
and
<head>
tags are omitted, implicit_tags
is going to insert both of them by going around the loop twice. In the
first iteration open_tags
is []
, so the code
adds an <html>
tag; then, in the second iteration,
open_tags
is ["html"]
so it adds a
<head>
tag.These add_tag
methods themselves call
implicit_tags
, which means you can get into an infinite
loop if you forget a case. Remember that every time you add a tag in
implicit_tags
, that tag itself shouldn’t trigger more
implicit tags.
Finally, the </head>
tag can also be implicit, if
the parser is inside the <head>
and sees an element
that’s supposed to go in the <body>
:
while True:
# ...
elif open_tags == ["html", "head"] and \
not in ["/head"] + self.HEAD_TAGS:
tag self.add_tag("/head")
Technically, the </body>
and
</html>
tags can also be implicit. But since our
finish
function already closes any unfinished tags, that
doesn’t need any extra code. So all that’s left for
implicit_tags
tags is to exit out of the loop:
while True:
# ...
else:
break
Of course, there are more rules for handling malformed HTML: formatting tags, nested paragraphs, embedded SVG and MathML, and all sorts of other complexity. Each has complicated rules abounding with edge cases. But let’s end our discussion of handling author errors here.
The rules for malformed HTML may seem arbitrary, and they are: they evolved over years of trying to guess what people “meant” when they wrote that HTML, and are now codified in the HTML parsing standard. Of course, sometimes these rules “guess” wrong—but as so often happens on the web, it’s often more important that every browser does the same thing, rather than each trying to guess what the right thing is.
Thanks to implicit tags, you can mostly skip the
<html>
, <body>
, and
<head>
elements, and they’ll be implicitly added back
for you. Nor does writing them explicitly let you do anything weird; the
HTML parser’s many
states guarantee that there’s only one <head>
and
one <body>
.At least, per document. An HTML file that uses frames or
templates can have more than one <head>
and
<body>
, but they correspond to different
documents.
This chapter taught our browser that HTML is a tree, not just a flat list of tokens. We added:
The tree structure of HTML is essential to display visually complex web pages, as we will see in the next chapter.
The complete set of functions, classes, and methods in our browser should look something like this:
def request(url)
WIDTH
HEIGHT
HSTEP
VSTEP
SCROLL_STEP
FONTS
def get_font(size, weight, slant)
class Text:
def __init__(text, parent)
def __repr__()
class Element:
def __init__(tag, attributes, parent)
def __repr__()
def print_tree(node, indent)
class HTMLParser:
def __init__(body)
def parse()
def get_attributes(text)
def add_text(text)
SELF_CLOSING_TAGS
def add_tag(tag)
HEAD_TAGS
def implicit_tags(tag)
def finish()
class Layout:
def __init__(tree)
def recurse(tree)
def open_tag(tag)
def close_tag(tag)
def text(node)
def flush()
class Browser:
def __init__()
def load(url)
def draw()
def scrolldown(e)
if __name__ == "__main__"
Comments: Update the HTML lexer to support comments.
Comments in HTML begin with <!--
and end with
-->
. However, comments aren’t the same as tags: they can
contain any text, including left and right angle brackets. The lexer
should skip comments, not generating any token at all. Check: is
<!-->
a comment, or does it just start one?
Paragraphs: It’s not clear what it would mean for one
paragraph to contain another. Change the parser so that a document like
<p>hello<p>world</p>
results in two
sibling paragraphs instead of one paragraph inside another; real
browsers do this too.
Scripts: JavaScript code embedded in a
<script>
tag uses the left angle bracket to mean
less-than. Modify your lexer so that the contents of
<script>
tags are treated specially: no tags are
allowed inside <script>
, except the
</script>
close tag.Technically it’s just
</script
followed by a space,
tab, \v
, \r
, slash, or greater than sign.
If you need to talk about </script>
tags inside
JavaScript code, you have to split it into multiple
strings.
Quoted attributes: Quoted attributes can contain spaces and
right angle brackets. Fix the lexer so that this is supported properly.
Hint: the current lexer is a finite state machine, with two states
(determined by in_tag
). You’ll need more states.
Syntax Highlighting: Implement the view-source:
protocol as in Chapter 1, but make it
syntax-highlight the source code of HTML pages. Keep source code for
HTML tags in a normal font, but make text contents bold. If you’ve
implemented it, wrap text in <pre>
tags as well to
preserve line breaks. Hint: subclass the HTML parser and use it to
implement your syntax highlighter.
Did you find this chapter useful?