HTML Tag Filter

A filter function is needed to filter out all the HTML tags as they are encountered within the text stream of an HTML file. Then a set of functions is required to interpret and expedite each tag type. First the filter.
int TagFilter(             //FILTER HTML TAGS FROM TEXT STREAM
  int c                    //next character from HTML file
) {
  static char tag[256],    //array in which to capture a tag
  *t;                      //pointer to captured tag string
  static int tc,           //count of number of chars in tag
  o,                       //offset of params from start of tag
  e;                       //0 = start tag, 1 = end tag
  if(c == '<') {           //start delimiter for an HTML tag
    t = tag;               //set t to start of tag array
    tc = 1;                //set that we are in an HTML tag
  }
  else if(tc > 0) {        //if we are currently in a tag
    if(c == '>') {         //end-delimiter for an HTML tag
      *t = '\0';           //put null at end of tag string
      tc = 0;              //set that we are in normal text
      TagInt(tag, o, e);   //appropriate tag interpreter
    }
    else if(tc < 256) {    //tag still within length limit
      if(tc == 1)          //if this is first char after <
        if(c == '/')       //if first char is a slash, it
          e = 1;           //indicates that it is an end-tag
        else               //first char is not a slash, this
          e = 0;           //indicates it is a start tag
      else if(c == ' ') {  //if this character is a space
        c = '\0';          //make it a null
        o = tc;            //set offset to this char number
      }
      else if(c > 64 && c < 91)  //if this character is a capital
        c += 32;                 //convert it to lower case
      *t++ = c;            //add char to tag buffer
      tc++;
    }
    else                   //if the tag string is too long
      tc = 0;              //so ignore the whole tag
    c = 0;                 //return a null char if in a tag
  }
  return(c);               //return given character or a null
}
As a character is received from the HTML file, it is passed to the above function. If it is part of the normal text, the above function simply returns it unchanged. However, if it is part of an HTML tag, it captures the character and returns a null.

Match Check

First we need a function to compare a tag captured from the HTML file with a tag in the above array to see if they are the same, higher or lower in alphabetical order.
int MatchCheck(
  char *p,          //ptr to tag name captured from HTML file
  char *q           //ptr to tag name in array of valid tag names
) {
  int c,            //character within captured tag name
      d,            //character within tag name in array
      flag = 0;     //indicates whether same, higher or lower
  do {              //for each char of the shorter tag
    if(( c = *p++) != (d = *q++)) {   //if chars are not the same 
      if(c > d)     //if char in captured tag > char in array tag
        flag = +1;  //set captured tag is higher than array tag
      else          //otherwise
        flag = -1;  //set captured tag is lower than array tag
      break;        //go and return the flag value
    }
  } while(c != 0 && d != 0);     //end loop if either char = null
  return(flag);
}
The do-loop tests at the end rather than at the beginning. This allows the null at the end of the shorter of the two tags to be compared with the corresponding character in the other tag before the loop termination test is done. Thus, if the tags are the same, eg
  base0        captured tag     0 = terminating null character
  base0        tag in array
then the last things compared are the two terminating nulls which are also the same. If the tags are of different lengths but the short tag is the same as the first part of the long one, eg
  base0        captured tag     0 = terminating null character
  basefont0    tag in array
then the terminating null of the shorter one is compared with the corresponding non-null character of the longer one. The inequality caught by the first if-statement causes the loop to be broken with the flag set to -1 or +1 as appropriate. If the tags are altogether different, eg
  base0        captured tag     0 = terminating null character
  body0        tag in array
then the first if-statement captures the first pair of unequal characters (in this case the a and the o) and the loop is again broken with the flag set to -1 or +1 as appropriate.

Tag Search

The next task I have to do is search for the given tag within the list of valid HTML tags held in the tags[] array. As HTML is extended, this list will grow. It is therefore expedient to design this search to be as fast as possible. The search technique I shall adopt is therefore the binary slice search as illustrated below.

The binary slice search starts half way up the alphabetically sorted tag names list. It compares the tag from the HTML file (base) with the middle tag in the array. It does not match. The array element is too high. That is, it is too far along the alphabet (ie too far up the array because 'a' is at the bottom and 'z' at the top). It therefore tries the element half way down the lower half of the array. It does not match this one either. It therefore divides the distance in half again. This time it hits the element containing the basefont tag name. This is still too high. It divides the skip distance in two again and tries that element. Eureka! It hits the element containing the tag name base. It notes its number within the tags array as the means of validating and identifying the HTML tag.

Of course, the direction of the skip to the next element (ie whether it is up or down the array) is determined by whether the tag name from the HTML file is too low or too high compared with the current array element. Furthermore, when the skip distance j is halved, the result is truncated. This is best done simply by right-shifting the amount by one bit in its register. This ensures that the result will never stray out of the range of the array. When the skip distance gets down to 1, it is kept there. It is not divided and truncated to 0 otherwise the index would never move to the next element.

I must now build this binary search into a function which can be called by the filter function. It must, on presentation of a pointer to the HTML tag name captured from the HTML file, return the index number of the given tag name. If it cannot find the tag in its array of valid HTML tags, it must indicate this by returning -1. This allow the array of HTML tag name pointers, tags[] to be private to this function.

My rendering of the tag search function is as follows. The pointer p points to a null terminated string which contains the captured tag's name only: not its arguments if any. There are 57 tag names. Arrays in C begin at element zero. The elements of tags[] are therefore 0 to 56. The mid-point element is number 28. This is the value to which k is initialised. The initial skip distance j is therefore half of this which is 14. These initial values must be updated if tag names are added to the array as HTML is extended.

int TagSearch(char *t, int e) {   //t points to captured tag name
  static char *tags[] = {         //all valid HTML tag names
    "a", "address", "applet", "area", "b", "base", "basefont", 
    "bgsound", "big", "blink", "blockquote", "body", "br", 
    "caption", "center", "cite", "code", "comment", "dd", "dfn", 
    "dir", "div", "dl", "dt", "em", "font", "form", "frame", 
    "frameset", "head", "hn", "hr", "html", "i", "img", "input", 
    "isindex", "kbd", "li", "link", "listing", "map", "marquee", 
    "menu", "meta", "nextid", "nobr", "noframes", "ol", "p", 
    "param", "plaintext", "pre", "samp", "select", "small", 
    "strike", "strong", "sub", "sup", "table", "td", "textarea", 
    "th", "title", "tr", "tt", "ul", "var", "wbr", "xmp"
  };
  int k = 36,               //start point of search within array
      j = 18,               //first index shift amount
      x = -1;               //returned index number of found tag
  while(k >= 0 && k < 71) { //while index within array range
    if((int y = MatchCheck(t, tags + k) == 0) {
      x = k;                //set index number of the found tag
      break;                //break out of the while loop
    } else {                //if this array element did not match
      if(y > 0)             //if given tag was higher in alphabet
        k -= j;             //shift down the array
      else
        k += j;             //otherwise shift up the array
      if(j > 1)             //if the shift increment > 1 then
        j >>= 1;            //halve it, otherwise leave it = 1
    }
  }
  return(x);                //return index number of given tag
}                           //this is -1 if tag was not found
The while loop terminates the search if the given tag is not found. If the given tag matches that pointed to by the kth element of tags[], MatchCheck() returns a zero. In this case, the while loop is broken to return the element number of the validated tag. If MatchCheck() returns a positive value, it means the given tag name was further up the alphabet than the kth element of tags[]. I therefore shift k down the array by an amount j. If MatchCheck() returns a negative value, I do the opposite. I then halve j by right-shifting it by one bit position ready for the next pass of the while loop.

Combined Function

MatchCheck() is called repeatedly within the while loop of TargSearch(). For the sake of speed, it is therefore expedient to see whether or not MatchCheck() can be incorporated into TagSearch() and thereby reduce the overall amount of code needed. And as can be seen below, indeed it can. In fact, in the event of a mis-match, the index k is now shifted inside the central do loop just before it is broken.
int TagSearch(char *t, int e) {  //points to name of captured tag
  static char *tags[] = {        //all valid HTML tag names
    "a", "address", "applet", "area", "b", "base", "basefont", 
    "bgsound", "big", "blink", "blockquote", "body", "br", 
    "caption", "center", "cite", "code", "comment", "dd", "dfn", 
    "dir", "div", "dl", "dt", "em", "font", "form", "frame", 
    "frameset", "head", "hn", "hr", "html", "i", "img", "input", 
    "isindex", "kbd", "li", "link", "listing", "map", "marquee", 
    "menu", "meta", "nextid", "nobr", "noframes", "ol", "p", 
    "param", "plaintext", "pre", "samp", "select", "small", 
    "strike", "strong", "sub", "sup", "table", "td", "textarea", 
    "th", "title", "tr", "tt", "ul", "var", "wbr", "xmp"
  };
  int k = 36,                //start point of search within array
      j = 18,                //first index shift amount
      x = -1;                //returned index number of found tag
  while(k >= 0 && k < 71) {  //while index within array range
    int c,                   //character within captured tag name
        d,                   //character within tag name in array
        f = 1;               //indicates whether a match or not
    char *p = t,             //ptr to a char in the captured tag
         *q = tags + k;      //ptr to a char in the array tag
    do {                     //for each char of the shorter tag
      if((c = *p++) != (d = *q++)) {    //if chars not same 
        if(c > d)            //if char in tag > char in array tag
          k -= j;            //abort test, shift down the array
        else                 //otherwise
          k += j;            //otherwise shift up the array
        if(j > 1)            //if the shift increment > 1 then
          j >>= 1;           //halve it, otherwise leave it = 1
        f = 0;               //set flag to indicate a mis-match
        break;               //go and return the flag value
      }
    } while(c != 0 && d != 0);   //end loop if either char = null
    if(flag) {               //if whole tag matched
      x = k;                 //set index number of the found tag
      break;                 //break out of the while loop
    }
  }
  return(x);                 //return index number of given tag
}                            //x = -1 if the tag was not found

Tag Call

We must now build a function TagCall( ) to call a special function to deal with the particular captured tag once it has found.

For this we need an array *funs[] which is an array of pointers-to-functions. Each of the functions thus pointed to takes a char pointer and two integers as input parameters and provides no return value. The string values with which the array is initialised are the names of the functions which perform the appropriate processing in response to each respective HTML tag name.

void TagCall(
  char *p,                    //points to the captured tag string
  int k,                      //index number of the captured tag
  int e                       //0 = it's a start tag, 1 = end tag 
) {
  static void (*funs[])(char * , int, int) = {
    Fa, Faddress, Fapplet, Farea, Fb, Fbase, Fbasefont, 
    Fbgsound, Fbig, Fblink, Fblockquote, Fbody, Fbr, 
    Fcaption, Fcenter, Fcite, Fcode, Fcomment, Fdd, Fdfn, 
    Fdir, Fdiv, Fdl, Fdt, Fem, Ffont, Fform, Fframe, 
    Fframeset, Fhead, Fhn, Fhr, Fhtml, Fi, Fimg, Finput, 
    Fisindex, Fkbd, Fli, Flink, Flisting, Fmap, Fmarquee, 
    Fmenu, Fmeta, Fnextid, Fnobr, Fnoframes, Fol, Fp, 
    Fparam, Fplaintext, Fpre, Fsamp, Fselect, Fsmall, 
    Fstrike, Fstrong, 	Fsub, Fsup, Ftable, Ftd, Ftextarea, 
    Fth, Ftitle, Ftr, 	Ftt, Ful, Fvar, Fwbr, Fxmp
  };
  (*(funs + k))(p, o, e);     //call appropriate tag function
}
The rather outlandish function call following the array calls the function whose address is the content of the kth element of the array funs[]. TagCall() would have to be called from TagFilter immediately after TagSearch had been called.

To speed things up, let us see if we can incorporate TagCall() into TagSearch() and thus simplify the whole process. All we need to do is to add the initialised array *funs[] to TagSearchl(), then replace the statement x = k; with the fancy function call above. Note that we no longer need the variable x.

The augmented version of TagSearch() which has been renamed more appropriately TagInt (Tag Interpreter) is shown below.

void TagInt(   //THE TAG INTERPRETER FUNCTION
  char *t,     //pointer to start of captured tag string
  int o,       //offset of start of parameters from start of tag
  int e        //0 indicates a start tag, 1 indicates and end tag
) {											
  static char *tags[] = {  //array of all 71 valid HTML tag names
    "a", "address", "applet", "area", "b", "base", "basefont", 
    "bgsound", "big", "blink", "blockquote", "body", "br", 
    "caption", "center", "cite", "code", "comment", "dd", "dfn", 
    "dir", "div", "dl", "dt", "em", "font", "form", "frame", 
    "frameset", "head", "hn", "hr", "html", "i", "img", "input", 
    "isindex", "kbd", "li", "link", "listing", "map", "marquee", 
    "menu", "meta", "nextid", "nobr", "noframes", "ol", "p", 
    "param", "plaintext", "pre", "samp", "select", "small", 
    "strike", "strong", "sub", "sup", "table", "td", "textarea", 
    "th", "title", "tr", "tt", "ul", "var", "wbr", "xmp"
  };
  static int (*funs[])(char *, int, int) = {
    Fa, Faddress, Fapplet, Farea, Fb, Fbase, Fbasefont, 
    Fbgsound, Fbig, Fblink, Fblockquote, Fbody, Fbr, 
    Fcaption, Fcenter, Fcite, Fcode, Fcomment, Fdd, Fdfn, 
    Fdir, Fdiv, Fdl, Fdt, Fem, Ffont, Fform, Fframe, 
    Fframeset, Fhead, Fhn, Fhr, Fhtml, Fi, Fimg, Finput, 
    Fisindex, Fkbd, Fli, Flink, Flisting, Fmap, Fmarquee, 
    Fmenu, Fmeta, Fnextid, Fnobr, Fnoframes, Fol, Fp, 
    Fparam, Fplaintext, Fpre, Fsamp, Fselect, Fsmall, 
    Fstrike, Fstrong, 	Fsub, Fsup, Ftable, Ftd, Ftextarea, 
    Fth, Ftitle, Ftr, 	Ftt, Ful, Fvar, Fwbr, Fxmp
  };
  int k = 36,               //Start search half way up array
      j = 18,               //first index shift amount
      x = 0;                //default return from tag function
  while(k >= 0 && k < 71) { //while index within array range
    int c,                  //character within captured tag name
        d,                  //character within tag name in array
        f = 1;              //indicates whether a match or not
    char *p = t,            //ptr to a char in the captured tag
         *q = *(tags + k);  //ptr to a char in the array tag
    do {                    //for each char of the shorter tag
      if(( c = *p++) != (d = *q++)) {       //if chars not same 
        if(c > d)           //if char in tag > char in array tag
          k -= j;           //abort test, shift down the array
        else
          k += j;           //otherwise shift up the array
        if(j > 1)           //if the shift increment > 1 then
          j >>= 1;          //halve it, otherwise leave it = 1
        f = 0;              //set flag to indicate a mis-match
        break;              //go and return the flag value
      }
    } while(c != 0 && d != 0);    //end loop if either char null
    if(f) {                       //if whole tag matched
      x = (*(funs + k))(p, o, e); //call tag function
      break;                      //break out of outer while loop
    }
  }
  return(x);                //pass back the tag function's return
}

This page's parent within this Web Site. About this Web Site. Its home page. Email its Author.