Lexical Analysis: The Role of the Lexical Analyzer Section 3.1

This chapter was mostly just a recap of things that were already covered about lexers.

3.1.1: Divide the following C++ program into appropriate lexemes. Which lexemes should get associated lexical values? What should those values be?

float limitedSquare(x) float x; {
  /* returns x-squared, but never more than 10 */
  return (x<=-10.0||x>=10.0)?100:x*x;
}

I will answer this all at once in a list. The format will be <lexeme[, value]>.

<float>
<id, "limitedSquare">
<openParen>
<id, "x">
<closeParen>
<float>
<id, "x">
<semiColon>
<openBrace>
Comments should not be returned by the lexer at all
<return>
<openParen>
<id, "x">
<lte>
<minus>
<literal, "10.0"> or if we split into numeric and string literals, 
<numericLiteral, 10.0> Presumably some lexers split further into float/int literals, too.
<logicalOr>
<id, "x">
<gte>
<literal, "10.0">
<closeParen>
<questionMark>
<literal, "100">
<colon>
<id, "x">
<operatorMul>
<id, "x">
<semicolon>
<closeBrace>

3.1.2 (difficult problem): Tagged languages like HTML or XML are different from conventional programming languages in that the punctuation (tags) are either very numerous (as in HTML) or a user-definable set (as in XML). Further, tags can often have parameters. Suggest how to divide the following HTML document:

Here is a photo of <b>my house</b>:
<img alt="" src="house.gif" /><BR>
See <a>More Pictures</a> if you liked that one.

So let’s think about this a little. First I thought, “well, HTML has well-defined tags. Why don’t I just have tokens like <openTag> <closeTag> for each tag? But of course, that ignores self-closing tags like <br /> and <img … />. Further, it ignores attributes.

So my next idea is to have <, </, >, and /> all be separate lexical tokens. I think that’s what I’m going to go with for this example.

To handle attributes, we’re going to pretend that HTML is actually generally valid and that all attributes have specified, double-quoted values, and that code like <input type=”checkbox” checked /> doesn’t have to be handled. So I’m going to go with the following:

<literal, "Here is a photo of ">
<openTag>
<tagName, "b">
<closeTag>
<literal, "my house">
<openCloseTag>
<tagName, "b">
<literal, ":">
<openTag>
<tagName, "img">
<attributeName, "alt">
<equals>
<attributeValue, "">
<attributeName, "src">
<attributeValue, "house.gif">
<closeCloseTag> -- this is a terrible name, I know
<openTag>
<tagName, "br">
<closeTag> -- should be a self-closer, tsk tsk
<literal, "See ">
...

The rest is clear. Some things to think about:

  1. Is assuming the lexer knows, based on context, that things are openTags and attributeNames and so on, unreasonable? I think it’s likely that it in fact is. However, if it’s not useful to separate it like this, then what are they asking me? You can use exactly the same lexer as you do for other source code, except specify that </ and /> are explicit symbols, and there’s nothing to be learned because this example is in HTML
  2. Maybe the better way to do this was to go a little lower. Distinguish between “unquoted strings in between braces” and “unquoted not in between braces,” but not further. So something like
    <text, "Here is a photo of">
    <openTag>
    <id, "b">
    <closeTag>
    <text, "my house">
  3. What’s the appropriate handling of attributes vs attribute values in option 2? Do we present attribute values to the parser as string literals or as some other form? Should it be an ‘id’, or just a regular string literal?
  4. For HTML, each legal tag could be given its own token. That’s probably a good idea. For XML that won’t work, unless we take a first pass at an XML document and use it to generate a lexer with a set of tokens. That would be interesting but I doubt it’s a good idea

 

This entry was posted in computer science, Dragon book. Bookmark the permalink.

Leave a comment