Formatting Text

Twitter · Blog · Patreon · Discussions

In the last chapter, your web browser created a graphical window and drew a grid of characters to it. That’s OK for Chinese, but English text features characters of different widths and words that you can’t break across lines.There are lots of languages in the world, and lots of typographic conventions. A real web browser supports every language from Arabic to Zulu, but this book focuses on English. Text is near-infinitely complex, but this book cannot be infinitely long! In this chapter, we’ll add those capabilities. You’ll be able to read this page in your browser!

What is a font?

So far, we’ve called create_text with a character and two coordinates to write text to the screen. But we never specified the font, the size, or the color. To talk about those things, we need to create and use font objects.

What is a font, exactly? Well, in the olden days, printers arranged little metal slugs on rails, covered them with ink, and pressed them to a sheet of paper, creating a printed page. The metal shapes came in boxes, one per letter, so you’d have a (large) box of e’s, a (small) box of x’s, and so on. The boxes came in cases (one for uppercase and one for lowercase letters). The set of cases was called a font.The word is related to foundry, which would create the little metal shapes. Naturally, if you wanted to print larger text, you needed different (bigger) shapes, so those were a different font; a collection of fonts was called a type, which is why we call it typing. Variations—like bold or italic letters—were called that type’s “faces”.

This nomenclature reflects the world of the printing press: metal shapes in boxes in cases of different types. Our modern world instead has dropdown menus, and the old words no longer match it. “Font” can now mean font, typeface, or type,Let alone “font family”, which can refer to larger or smaller collections of types. and we say a font contains several different weights (like “bold” and “normal”),But sometimes other weights as well, like “light”, “semibold”, “black”, and “condensed”. Good fonts tend to come in many weights. several different styles (like “italic” and “roman”, which is what not-italic is called),Sometimes there are other options as well, like maybe there’s a small-caps version; these are sometimes called options as well. And don’t get me started on automatic versus manual italics. and arbitrary sizes.Font looks especially good at certain sizes where hints tell the computer how to best to align it to the pixel grid. Welcome to the world of magic ink.

Yet Tk’s font objects correspond to the older meaning of font: a type at a fixed size, style, and weight. For example:You can only create Font objects, or any other kinds of Tk objects, after calling tkinter.Tk(), which is why I’m putting this code in the Browser constructor.

import tkinter.font

class Browser:
    def __init__(self):
        # ...
        bi_times = tkinter.font.Font(
            family="Times",
            size=16,
            weight="bold",
            slant="italic",
        )

Your computer might not have “Times” installed; you can list the available fonts with tkinter.font.families() and pick something else.

Font objects can be passed to create_text’s font argument:

canvas.create_text(200, 100, text="Hi!", font=bi_times)

In the olden times, American type setters kept their boxes of metal shapes arranged in a California job case, which combined lower- and upper-case letters side by side in one case, making type setting easier. The upper-/lower-case nomenclature dates from centuries earlier.

Measuring text

Text takes up space vertically and horizontally, and the font object’s metrics and measure methods measure that space:On your computer, you might get different numbers. That’s right—text rendering is OS-dependent, because it is complex enough that everyone uses one of a few libraries to do it, usually libraries that ship with the OS. That’s why macOS fonts tend to be “blurrier” than the same font on Windows: different libraries make different trade-offs.

>>> bi_times.metrics()
{'ascent': 15, 'descent': 7, 'linespace': 22, 'fixed': 0}
>>> bi_times.measure("Hi!")
31

The metrics call yields information about the vertical dimensions of the text: the linespace is how tall the text is, which includes an ascent which goes “above the line” and a descent that goes “below the line”.The fixed parameter is actually a boolean and tells you whether all letters are the same width, so it doesn’t really fit here. The ascent and descent matter when words in different sizes sit on the same line: they ought to line up “along the line”, not along their tops or bottoms.

Let’s dig deeper. Remember that bi_times is size-16 Times: why does font.metrics report that it is actually 22 pixels tall? Well, first of all, size-16 meant sixteen points, which are defined as 72nds of an inch, not sixteen pixels, which your monitor probably has around 100 of per inch.Tk doesn’t use points anywhere else in its API. It’s supposed to use pixels if you pass it a negative number, but that doesn’t appear to work. Those sixteen points measure not the individual letters but the metal blocks the letters were once carved from, which by necessity were larger than the letters themselves. In fact, different size-16 fonts have letters of varying heights:

>>> tkinter.font.Font(family="Courier", size=16).metrics()
{'fixed': 1, 'ascent': 13, 'descent': 4, 'linespace': 17}
>>> tkinter.font.Font(family="Times", size=16).metrics()
{'fixed': 0, 'ascent': 14, 'descent': 4, 'linespace': 18}
>>> tkinter.font.Font(family="Helvetica", size=16).metrics()
{'fixed': 0, 'ascent': 15, 'descent': 4, 'linespace': 19}

The measure() method is more direct: it tells you how much horizontal space text takes up, in pixels. This depends on the text, of course, since different letters have different width:The sum at the end of this snippet may not work on your machine: the width of a word is not always the sum of the widths of its letters. That’s because Tk uses fractional pixels internally, but rounds up to return whole pixels. For example, some fonts use something called kerning to shift letters a little bit when particular pairs of letters are next to one another.

>>> bi_times.measure("Hi!")
31
>>> bi_times.measure("H")
17
>>> bi_times.measure("i")
6
>>> bi_times.measure("!")
8
>>> 17 + 8 + 6
31

You can use this information to lay text out on the page. For example, suppose you want to draw the text “Hello, world!” in two pieces, so that “world!” is italic. Let’s use two fonts:

font1 = tkinter.font.Font(family="Times", size=16)
font2 = tkinter.font.Font(family="Times", size=16, slant='italic')

We can now lay out the text, starting at (200, 200):

x, y = 200, 200
canvas.create_text(x, y, text="Hello, ", font=font1)
x += font1.measure("Hello, ")
canvas.create_text(x, y, text="world!", font=font2)

You should see “Hello,” and “world!”, correctly aligned and with the second word italicized.

Unfortunately, this code has a bug, though one masked by the choice of example text: replace “world!” with “overlapping!” and the two words will overlap. That’s because the coordinates x and y that you pass to create_text tell Tk where to put the center of the text. It only worked for “Hello, world!” because “Hello,” and “world!” are the same length!

Luckily, the meaning of the coordinate you pass in is configurable. We can instruct Tk to treat the coordinate we gave as the top-left corner of the text by setting the anchor argument to "nw", meaning the “northwest” corner of the text:

x, y = 200, 225
canvas.create_text(x, y, text="Hello, ", font=font1, anchor='nw')
x += font1.measure("Hello, ")
canvas.create_text(x, y, text="overlapping!", font=font2, anchor='nw')

Modify the draw function to set anchor to "nw"; we didn’t need to do that in the previous chapter because all Chinese characters are the same width.

If you find font metrics confusing, you’re not the only one! In 2012, the Michigan Supreme Court heard Stand Up for Democracy v. Secretary of State, a case that centered on the definition of font size. The court decided (correctly) that font size is the size of the metal blocks that letters were carved from and not the size of the letters themselves.

Word by word

In the last chapter, the layout function looped over the text character-by-character and moved to the next line whenever we ran out of space. That’s appropriate in Chinese, where each character more or less is a word. But in English you can’t move to the next line in the middle of a word. Instead, we need to lay out the text one word at a time:This code splits words on whitespace. It’ll thus break on Chinese, since there won’t be whitespace between words. Real browsers use language-dependent rules for laying out text, including for identifying word boundaries.

w = font.measure(word)
if cursor_x + w > WIDTH - HSTEP:
    cursor_y += font.metrics("linespace") * 1.25
    cursor_x = HSTEP
self.display_list.append((cursor_x, cursor_y, word))
cursor_x += w + font.measure(" ")

There’s a lot of moving parts to this code. First, we measure the width of the text, and store it in w. We’d normally draw the text at cursor_x, so its right end would be at cursor_x + w, so we check if that’s past the edge of the page. Now we have the location to start drawing the word, so we add to the display list; and finally we update cursor_x to point to the end of the word.

There are a few surprises in this code. One is that I call metrics with an argument; that just returns the named metric directly. Also, I increment cursor_x by w + font.measure(" ") instead of w. That’s because I want to have spaces between the words: the call to split() removed all of the whitespace, and this adds it back. I don’t add the space to w on the second line, though, because you don’t need a space after the last word on a line.

Finally, note that I multiply the linespace by 1.25 when incrementing y. Try removing the multiplier: you’ll see that the text is harder to read because the lines are too close together.Designers say the text is too “tight”. Instead, it is common to add “line spacing” or “leading”So named because in metal type days, thin pieces of lead were placed between the lines to space them out. Lead is a softer metal than what the actual letter pieces were made of, so it could compress a little to keep pressure on the other pieces. Pronounce it “led-ing” not “leed-ing”. between lines. The 25% line spacing is a normal amount.

Breaking lines in the middle of a word is called hyphenation, and can be turned on via the hyphens CSS property. Browsers use the Knuth-Liang hyphenation algorithm, which uses a dictionary of word fragments to prioritize possible hyphenation points, to implement this.

Styling text

Right now, all of the text on the page is drawn with one font. But web pages sometimes bold or italicize text using the <b> and <i> tags. It’d be nice to support that, but right now, the code resists the change: the layout function only receives the text of the page as input, and so has no idea where the bold and italics tags are.

Let’s change lex to return a list of tokens, where a token is either a Text object (for a run of characters outside a tag) or a Tag object (for the contents of a tag). You’ll need to write the Text and Tag classes:If you’re familiar with Python, you might want to use the dataclass library, which makes it easier to define these sorts of utility classes.

class Text:
    def __init__(self, text):
        self.text = text

class Tag:
    def __init__(self, tag):
        self.tag = tag

lex must now gather text into Text and Tag objects:If you’ve done exercises in prior chapters, your code will look different. Code snippets in the book always assume you haven’t done the exercises, so you’ll need to port your modifications.

def lex(body):
    out = []
    text = ""
    in_tag = False
    for c in body:
        if c == "<":
            in_tag = True
            if text: out.append(Text(text))
            text = ""
        elif c == ">":
            in_tag = False
            out.append(Tag(text))
            text = ""
        else:
            text += c
    if not in_tag and text:
        out.append(Text(text))
    return out

At the end of the loop, lex dumps any accumulated text as a Text object. Otherwise, if you never saw an angle bracket, you’d return an empty list of tokens. But unfinished tags, like in Hi!<hr, are thrown out.This may strike you as an odd decision: why not raise an error, or finish up the tag for the author? Good questions, but dropping the tag is what browsers do.

Note that Text and Tag are asymmetric: lex avoids empty Text objects, but not empty Tag objects. That’s because an empty Tag object represents the HTML code <>, while an empty Text object with empty text represents no content at all.

Since we’ve modified lex we are now passing layout not just the text of the page, but also the tags in it. So layout must loop over tokens, not text:

def layout(tokens):
    # ...
    for tok in tokens:
        if isinstance(tok, Text):
            for word in tok.text.split():
                # ...
    # ...

layout can also examine tag tokens to change font when directed by the page. Let’s start with support for weights and styles, with two corresponding variables:

weight = "normal"
style = "roman"

Those variables must change when the bold and italics open and close tags are seen:

if isinstance(tok, Text):
    # ...
elif tok.tag == "i":
    style = "italic"
elif tok.tag == "/i":
    style = "roman"
elif tok.tag == "b":
    weight = "bold"
elif tok.tag == "/b":
    weight = "normal"

Note that this code correctly handles not only <b>bold</b> and <i>italic</i> text, but also <b><i>bold italic</i></b> text.It even handles mis-nested tags like <b>b<i>bi</b>i</i>, but it does not handle <b><b>twice</b>bolded</b> text. We’ll return to both in the next chapter.

The bold and italic variables are used to select the font. Since the font is computed in layout but used in draw, we’ll need to add the font used to each entry in the display list.

if isinstance(tok, Text):
    for word in tok.text.split():
        font = tkinter.font.Font(
            size=16,
            weight=weight,
            slant=style,
        )
        # ...
        display_list.append((cursor_x, cursor_y, word, font))

Make sure to update draw to expect and use this extra font field in display list entries.

Italic fonts were developed in Italy (hence the name) to mimic a cursive handwriting style called “chancery hand”. Non-italic fonts are called roman because they mimic text on Roman monuments. There is an obscure third option: oblique fonts, which look like roman fonts but are slanted.

A layout object

With all of these tags, layout has become quite large, with lots of local variables and some complicated control flow. That is one sign that something deserves to be a class, not a function:

class Layout:
    def __init__(self, tokens):
        self.display_list = []

Every local variable in layout then becomes a field of Layout:

self.cursor_x = HSTEP
self.cursor_y = VSTEP
self.weight = "normal"
self.style = "roman"
self.size = 16

The core of the old layout is a loop over tokens, and we can move the body of that loop to a method on Layout:

def __init__(self, tokens):
    # ...
    for tok in tokens:
        self.token(tok)

def token(self, tok):
    if isinstance(tok, Text):
        for word in tok.text.split():
            # ...
    elif tok.tag == "i":
        self.style = "italic"
    # ...

In fact, the body of the isinstance(tok, Text) branch can be moved to its own method:

def word(self, word):
    font = tkinter.font.Font(
        size=16,
        weight=self.weight,
        slant=self.style,
    )
    w = font.measure(word)
    # ...

Now that everything has moved out of Browser’s old layout function, it can be replaced with calls into Layout:

class Browser:
    def load(self, url):
        headers, body = url.request()
        tokens = lex(body)
        self.display_list = Layout(tokens).display_list
        self.draw()

When you do big refactors like this, it’s important to work incrementally. It might seem more efficient to change everything at once, that efficiency brings with it a risk of failure: trying to do so much that you get confused and have to abandon the whole refactor.

Anyway, this refactor isolated all of the text-handling code into its own method, with the main token function just branching on the tag name. Let’s take advantage of the new, cleaner organization to add more tags. With font weights and styles working, size is the next frontier in typographic sophistication. One simple way to change font size is the <small> tag and its deprecated sister tag <big>.In your web design projects, use the CSS font-size property to change text size instead of <big> and <small>. But since we haven’t implemented CSS for our browser, we’re stuck using them here.

Our experience with font styles and weights suggests a simple approach. First, a field in Layout to track font size:

self.size = 16

That variable is used to create the font object:

font = tkinter.font.Font(
    size=self.size,
    weight=self.weight,
    slant=self.style,
)

Finally, the <big> and <small> tags change the value of size:

def token(self, tok):
    # ...
    elif tok.tag == "small":
        self.size -= 2
    elif tok.tag == "/small":
        self.size += 2
    elif tok.tag == "big":
        self.size += 4
    elif tok.tag == "/big":
        self.size -= 4

Try wrapping a whole paragraph in <small>, like you would a bit of fine print, and enjoy your newfound typographical freedom.

All of <b>, <i>, <big>, and <small> date from an earlier, pre-CSS era of the web. Since CSS can now change how those tags appear, <b>, <i>, and <small> have hair-splitting appearance-independent meanings.

Text of different sizes

Start mixing font sizes, like <small>a</small><big>A</big>, and you’ll quickly notice a problem with the font size code: the text is aligned along its top, not “along the line”, as if it’s hanging from a clothes line.

Let’s think through how to fix this. If the big text is moved up, it would overlap with the previous line, so the smaller text has to be moved down. That means its vertical position has to be computed later, after the big text passes through token. But since the small text comes through the loop first, we need a two-pass algorithm for lines of text: the first pass identifies what words go in the line and computes their x positions, while the second pass vertically aligns the words and computes their y positions.

Let’s start with phase one. Since one line contains text from many tags, we need a field on Layout to store the line-to-be. That field, line, will be a list, and text will add words to it instead of the display list. Entries in line will have x but not y positions, since y positions aren’t computed in the first phase:

class Layout:
    def __init__(self, tokens):
        # ...
        self.line = []
        # ...
    
    def word(self, word):
        # ...
        self.line.append((self.cursor_x, word, font))

The new line field is essentially a buffer, where words are held temporarily before they can be placed. The second phase is that buffer being flushed when we’re finished with a line:

class Layout:
    def word(self, word):
        if self.cursor_x + w > WIDTH - HSTEP:
            self.flush()

As usual with buffers, we also need to make sure the buffer is flushed once all tokens are processed:

class Layout:
    def __init__(self, tokens):
        # ...
        self.flush()

This new flush function has three responsibilities:

  1. It must align the words along the line;
  2. It must add all those words to the display list; and
  3. It must update the cursor_x and cursor_y fields

Here’s what it looks like, step by step:

Since we want words to line up “on the line”, let’s start by computing where that line should be. That depends on the metrics for all the fonts involved:

def flush(self):
    if not self.line: return
    metrics = [font.metrics() for x, word, font in self.line]

We need to locate the tallest word:

max_ascent = max([metric["ascent"] for metric in metrics])

The line is then max_ascent below self.y—or actually a little more to account for the leading:Actually, 25% leading doesn’t add 25% of the ascender above the ascender and 25% of the descender below the descender. Instead, it adds 12.5% of the line height in both places, which is subtly different when fonts are mixed. But let’s skip that subtlety here.

baseline = self.cursor_y + 1.25 * max_ascent

Now that we know where the line is, we can place each word relative to that line and add it to the display list:

for x, word, font in self.line:
    y = baseline - font.metrics("ascent")
    self.display_list.append((x, y, word, font))

Note how y starts at the baseline, and moves up by just enough to accomodate that word’s ascender.

Finally, flush must update the Layout’s x, y, and line fields. x and line are easy:

self.cursor_x = HSTEP
self.line = []

Meanwhile, y must be far enough below baseline to account for the deepest descender:

max_descent = max([metric["descent"] for metric in metrics])
self.cursor_y = baseline + 1.25 * max_descent

Now all the text is aligned along the line, even when text sizes are mixed. Plus, this new flush function is convenient for other line breaking jobs. For example, in HTML the <br> tagWhich is a self-closing tag, so there’s no </br>. Many tags that are content, instead of annotating it, are like this. Some people like adding a final slash to self-closing tags, like <br/>, but this is not required in HTML. ends the current line and starts a new one:

def token(self, tok):
    # ...
    elif tok.tag == "br":
        self.flush()

Likewise, paragraphs are defined by the <p> and </p> tags, so </p> also ends the current line:

def token(self, tok):
    # ...
    elif tok.tag == "/p":
        self.flush()
        self.cursor_y += VSTEP

I add a bit extra to cursor_y here to create a little gap between paragraphs.

Actually, browsers support not only horizontal but also vertical writing systems, like some traditional East Asian writing styles. A particular challenge is Mongolian script.

Faster text layout

Now that you’ve implemented styled text, you’ve probably noticed—unless you’re on macOSWhile we can’t confirm this in the documentation, it seems that the macOS “Core Text” APIs cache fonts more aggressively than Linux and Windows. The optimization described in this section won’t hurt any on macOS, but also won’t improve speed as much as on Windows and Linux.—that on a large web page like this chapter your browser has slowed significantly from the last chapter. That’s because text layout, and specifically the part where you measure each word, is quite slow.You can profile Python programs by replacing your python3 command with python3 -m cProfile. Look for the lines corresponding to the measure and metrics calls to see how much time is spent measuring text.

Unfortunately, it’s hard to make text measurement much faster. With proportional fonts and complex font features like hinting and kerning, measuring text can require pretty complex computations. But on a large web page, some words likely appear a lot—for example, this page includes the word “the” over two hundred times. Instead of measuring these words over and over again, we could measure them once, and then cache the results. On normal English text, this usually results in a substantial speedup.

Caching is such a good idea that most text libraries already implement it. But because our text method creates a new Font object for each word, our browser isn’t taking advantage of that caching. If we only made a new Font object when we had to, the built-in caches would work better and our browser would be faster. So we’ll need our own cache, so that we can reuse Font objects and have our text measurements cached.

We’ll store our cache in a global FONTS dictionary:

FONTS = {}

The keys to this dictionary will be size/weight/style triples, and the values will be Font objects. We can put the caching logic itself in a new get_font function:

def get_font(size, weight, slant):
    key = (size, weight, slant)
    if key not in FONTS:
        font = tkinter.font.Font(size=size, weight=weight, slant=slant)
        FONTS[key] = font
    return FONTS[key]

Now, inside the text method we can call get_font instead of creating a Font object directly:

class Layout:
    def word(self, word):
        font = get_font(self.size, self.weight, self.style)
        # ...

Fonts for scripts like Chinese can be megabytes in size, so they are generally stored on disk and only loaded into memory on-demand. That makes font loading slow. Browsers also have extensive caches for measuring, shaping, and rendering text. Because web pages have a lot of text, these caches turn out to be one of the most important parts of speeding up rendering.

Summary

The last chapter introduced a browser that laid out Chinese text. Now it does English, too:

You can now use your browser to read an essay, a blog post, or a book!

Close

Outline

The complete set of functions, classes, and methods in our browser should look something like this:

class URL: def __init__(url) def request() WIDTH HEIGHT HSTEP VSTEP SCROLL_STEP class Browser: def __init__() def load(url) def draw() def scrolldown(e) class Text: def __init__(text) def __repr__() class Tag: def __init__(tag) def __repr__() def lex(body) FONTS def get_font(size, weight, slant) class Layout: def __init__(tokens) def token(tok) def word(word) def flush() if __name__ == "__main__"

Exercises

Centered Text: This book’s page titles are centered: find them between <h1 class="title"> and </h1>. Make your browser center the text in these titles. Each line has to be centered individually, because different lines will have different lengths.

Superscripts: Add support for the <sup> tag: text in this tag should be smaller (perhaps half the normal text size) and be placed so that the top of a superscript lines up with the top of a normal letter.

Soft hyphens: The soft hyphen character, written \N{soft hyphen} in Python, represents a place where the text renderer can, but doesn’t have to, insert a hyphen and break the word across lines. Add support for it.If you’ve done a previous exercise on HTML entities, you might also want to add support for the &shy; entity, which expands to a soft hyphen. If a word doesn’t fit at the end of a line, check if it has soft hyphens, and if so break the word across lines. Remember that a word can have multiple soft hyphens in it, and make sure to draw a hyphen when you break a word. The word “super­cala­fraga­listic­expi­ala­do­shus” is a good test case.

Small caps: Make the <abbr> element render text in small caps, like this. Inside an <abbr> tag, lower-case letters should be small, capitalized, and bold, while all other characters (upper case, numbers, etc) should be drawn in the normal font.

Preformatted text: Add support for the <pre> tag. Unlike normal paragraphs, text inside <pre> tags doesn’t automatically break lines, and whitespace like spaces and newlines are preserved. Use a fixed-width font like Courier New or SFMono as well. Make sure tags work normally inside <pre> tags: it should be possible to bold some text inside a <pre>.

Did you find this chapter useful?