Home Updates Messages SandBox

Wiki Markup Parser in Python

One of the more interesting parts of a wiki engine is the parser: the code that takes the raw text of a page, as entered by the users into the text area of their browsers, and turns it into actual HTML of the web page, as displayed in the "normal" view of the page. Personally I think that parsing is one of the more intriguing topics in computer science, but what makes parsing wiki text especially worth exploring is the nature of the markup: wiki markup is usually not described in a formal way. It is supposed to be simple for humans to use, not for the machines to parse. In the end it appears that most wiki markups in the wild are in fact regular. There might be a few exceptions that are context-free, but that's just because the designer didn't bother to limit the nesting. But the infinite nesting of list elements or quotations is never actually used: in fact its usefulness is constrained by the width of the browser window (you can only do so much indenting). So, in practice, most wiki markups in the wild are regular languages.

This is no mistake. The very first wiki engines were written in languages like Perl, and the parsing was indeed performed using regular expressions. Usually multiple regular expressions, and multiple passes, because it was simpler to write this way. Later, as there were more and more attempts at writing wiki engines in various languages, programmers sometimes attempted to build their own ad-hoc state machines – sometimes more, sometimes less formal. Some of them even have a stack to keep track of generated HTML elements – but that's because HTML is context-free, not because the language parsed was. I think that to this day the prevailing technique is to use regular expressions in two or more passes.

The first wiki engine I ever explored in depth was PikiPiki. Written in python, using simple files for its database, and using regular expressions to parse the text:

class PageFormatter:
    """Object that turns Wiki markup into HTML.

    All formatting commands can be parsed one line at a time, though
    some state is carried over between lines.
    def __init__(self, raw):
        self.raw = raw
        self.is_em = self.is_b = 0
        self.list_indents = []
        self.in_pre = 0

    def _emph_repl(self, word):
        if len(word) == 3:
            self.is_b = not self.is_b
            return ['</b>', '<b>'][self.is_b]
            self.is_em = not self.is_em
            return ['</em>', '<em>'][self.is_em]

    def _rule_repl(self, word):
        s = self._undent()
        if len(word) <= 4:
            s = s + "\n<hr>\n"
            s = s + "\n<hr size=%d>\n" % (len(word) - 2 )
        return s

    def _word_repl(self, word):
        return Page(word).link_to()

    def _url_repl(self, word):
        return '<a href ="%s">%s</a>' % (word, word)

    def _email_repl(self, word):
        return '<a href ="mailto:%s">%s</a>' % (word, word)

    def _ent_repl(self, s):
        return {'&': '&amp;',
                '<': '&lt;',
                '>': '&gt;'}[s]

    def _li_repl(self, match):
        return '<li>'

    def _pre_repl(self, word):
        if word == '{{{' and not self.in_pre:
            self.in_pre = 1
            return '<pre>'
        elif self.in_pre:
            self.in_pre = 0
            return '</pre>'
            return ''

    def _macro_repl(self, word):
        macro_name = word[2:-2]
        # TODO: Somehow get the default value into the search field
        return apply(globals()['_macro_' + macro_name], ())

    def _indent_level(self):
        return len(self.list_indents) and self.list_indents[-1]

    def _indent_to(self, new_level):
        s = ''
        while self._indent_level() > new_level:
            s = s + '</ul>\n'
        while self._indent_level() < new_level:
            s = s + '<ul>\n'
        return s

    def _undent(self):
        res = '</ul>' * len(self.list_indents)
        self.list_indents = []
        return res

    def replace(self, match):
        for type, hit in match.groupdict().items():
            if hit:
                return apply(getattr(self, '_' + type + '_repl'), (hit,))
            raise "Can't handle match " + `match`

    def print_html(self):
        # For each line, we scan through looking for magic
        # strings, outputting verbatim any intervening text
        scan_re = re.compile(
            + r"|(?P<ent>[<>&])"
            + r"|(?P<word>\b(?:[A-Z][a-z]+){2,}\b)"
            + r"|(?P<rule>-{4,})"
            + r"|(?P<url>(http|ftp|nntp|news|mailto)\:[^\s'\"]+\S)"
            + r"|(?P<email>[-\w._+]+\@[\w.-]+)"
            + r"|(?P<li>^\s+\*)"
            + r"|(?P<pre>(\{\{\{|\}\}\}))"
            + r"|(?P<macro>\[\[(TitleSearch|FullSearch|WordIndex"
                            + r"|TitleIndex|RecentChanges|GoTo)\]\])"
            + r")")
        blank_re = re.compile("^\s*$")
        bullet_re = re.compile("^\s+\*")
        indent_re = re.compile("^\s*")
        eol_re = re.compile(r'\r?\n')
        raw = string.expandtabs(self.raw)
        for line in eol_re.split(raw):
            if not self.in_pre:
                # XXX: Should we check these conditions in this order?
                if blank_re.match(line):
                    print '<p>'
                indent = indent_re.match(line)
                print self._indent_to(len(indent.group(0)))
            print re.sub(scan_re, self.replace, line)
        if self.in_pre: print '</pre>'
        print self._undent()

The parser is line-based. It has some external state, like the level of indentation and the flags in_pre, in_em and in_b telling it whether it's inside these kinds of blocks – they are needed to carry the state between the lines – but most of the work is done using the scan_re regular expression. So, we have a kind of mixed approach here. On the line level (we will call it "block-level"), we have an ad-hoc state machine, but on the character level (or "text-level"), there is a regular expression.

The use of re.sub() in this parser is particularly interesting. Normally, that method will replace all matches of th pattern provided with its second argument. But in this case the second argument is a function – so that function will be called for every match, and the match will be replaced with whatever the function returns. The function gets the match object as a parameter, so it can see what was actually matched. The match object contains, among other things, all the groups from the regular expression: those are the bits enclosed in (?P<name> ...), where "name" is the group's name. The replace function will check those groups and call appropriate _*_repl() functions, that are responsible for generating actual HTML code for various elements of the wiki language. Simple, eh?

This parser doesn't always generate valid HTML – mostly due to sloppiness of the block-level code: it won't close <p> tags or close any inline tags before starting a new block tag. It also will not escape the "&", ">" and "<" characters in URLs – probably because browsers are forgiving to this kind of error.

What I would like to do is removing the "ad-hoc parser" from this design. I can see two ways of doing it: replacing it with another regular expression, using roughly the same technique that was used for text-level parsing; and removing the block-level/text-level separation entirely, modifying the text-level regular expression to also handle separation into lines and blocks. The latter approach has one large advantage: the text is only ever scanned once, whereas with the two other approaches (the above one and the "two regular expressions" one), the text is scanned at least twice: once to split it into lines or blocks, and second time to apply the text-level regular expression. This approach has also one huge disadvantage: the regular expression becomes ridiculously complicated, and the whole thing becomes pretty hard to understand and debug.

Two-level parsing with regular expressions seem to be much easier and straightforward. That's what I used in my Oink:WikiCreole parser in python for MoinMoin, only I was forced to introduce an additional third level in two places: inside lists and tables. Not that it couldn't be avoided – I just decided that keeping it on two levels wasn't worth complicating the regular expressions. It's still O(n), after all. Oh, and I use the generated DOM tree to keep some of the parser's state. This also could be avoided at the expense of complicating the regular expression. Oh well.

I found another way of writing a wiki parser. You might remember the technique I described previously, that used a regular expression to divide the text into blocks, then several other regular expressions to parse these blocks separately. It has the advantage of being relatively fast, as it uses built-in features, but it also has several disadvantages: you need to put whole text into memory as a string, as python's regular expressions can't work on any iterables, you cannot tell which line of generated html corresponds to given line of input text, debugging might require a lot of experience about regular expressions.

Today I attempted to write a similar parser, but make it line-based, so that you can give it any iterable of lines as input, and it will produce as much output as possible right away.

First, we need a function that, given a line of text, would determine what kind of block that line belongs to. There are several possibilities: normal paragraph, empty line that separates paragraphs, bullet list, indented text, heading, block of code, etc. Except for the code block (and later block-level macros), all lines can be recognized on their own, without any need to keep state. We can use something like this:

heading_re = re.compile(ur"^\s*=+.*=+$", re.U)
bullets_re = re.compile(ur"^\s*\*\s+", re.U)
empty_re = re.compile(ur"^\s*$", re.U)
code_re = re.compile(ur"^\{\{\{+\s*$", re.U)
indent_re = re.compile(ur"^\s+", re.U)

def get_line_kind(self, line):
    if self.heading_re.match(line):
        return "heading"
    elif self.bullets_re.match(line):
        return "bullets"
    elif self.empty_re.match(line):
        return "empty"
    elif self.code_re.match(line):
        return "code"
    elif self.indent_re.match(line):
        return "indent"
        return "paragraph"

where the *_re attributes are regular expressions for recognizing various kinds of lines. This code is pretty clumsy, we can do better than that:

block = {
    "bullets": ur"^\s*[*]\s+",
    "code": ur"^[{][{][{]+\s*$",
    "empty": ur"^\s*$",
    "heading": ur"^\s*=+.*=+$",
    "indent": ur"^[ \t]+",
} # note that the priority is alphabetical
block_re = re.compile(ur"|".join("(?P<%s>%s)" % kv
                      for kv in sorted(block.iteritems())))
def get_line_kind(self, line):
    match = self.block_re.match(line)
    if match:
        return match.lastgroup
        return "paragraph"

this code does roughly the same thing, only more efficiently – it only scans the line at most once. I guess you can tell by now that I have a certain weakness for making huge regular expressions from dicts :) Anyways, the function is still a bit clumsy, we can make it a one-liner, using the fact that None dosn't have a lastgroup attribute:

def get_line_kind(self, line):
    return getattr(self.block_re.match(line), "lastgroup", "paragraph")

Note that the "code" block is not really complete – we only recognize the first line of it. That's because we will be eating up the rest of lines for it independently. Once we have the lines categorized, we can try the first crude parser:

def block_paragraph(self, line):
    return "<p>%s</p>" % line


def parse(self):
    for line in self.lines:
        kind = self.get_line_kind(line)
        func = getattr(self, "block_%s" % kind)
        yield func(line)

This will call various self.block_* methods depending on the kind of the line, and yield their results. Looks good, but there is a problem. Let's test it with simple text:

Lorem ipsum dolor sit amet, consectetuer
adipiscing elit, sed diam nonummy nibh
euismod tincidunt ut laoreet dolore magna
aliquam erat volutpat.

Ut wisi enim ad minim veniam, quis nostrud
exerci tation ullamcorper suscipit lobortis
nisl ut aliquip ex ea commodo consequat.

the result is something like this:

<p>Lorem ipsum dolor sit amet, consectetuer
</p><p>adipiscing elit, sed diam nonummy nibh
</p><p>euismod tincidunt ut laoreet dolore magna
</p><p>aliquam erat volutpat.

<p>Ut wisi enim ad minim veniam, quis nostrud
</p><p>exerci tation ullamcorper suscipit lobortis
</p><p>nisl ut aliquip ex ea commodo consequat.

What happened? Every line is a separate paragraph, and we would like to have them grouped: as long as the consecutive lines are of the same kind, they should be in the same block. We can achieve it with the itertools.groupby iterator:

def block_paragraph(self, block):
    yield u'<p>'
    for line in block:
        yield line
    yield u'</p>'


def parse(self):
    for kind, block in itertools.groupby(self.lines, self.get_line_kind):
        func = getattr(self, "block_%s" % kind)
        for part in func(block):
            yield part

This solves the problem in a pretty elegant way, except for the code block. But we can do a hack for it, and read lines for the code block from the lines iterator directly, just after encountering the code opening line:

    def block_code(self, block):
        for part in block:
            yield u'<pre>'
            line = self.lines.next()
            while not line.startswith("}}}"):
                yield line
                line = self.lines.next()
            yield u'</pre>'

The outer loop takes care of a case when there are several consecutive code blocks. You probably still need to cache the lines and strip them and add your own newlines, so that you don;t get an empty lines at the end – but these are details. You also still need to parse the contents of paragraphs – for inline markup like bold text and links – I think that it's still best done with one huge regexp, just like I described previously.

I hope this helps any wiki developers out there.