HtmlZap sample program

This little code snippet opens all the files that match the filespec j:\intro\page*.htm. It steps through each such file and:

  • removes all the <font>, </font>, and <p> tags.
  • converts all </p> tags to <p> tags.
  • for each <img> tag, copies the source bitmap file to the j:\intro\copy subdirectory, then uses an invisible Picture Box control to find the width and height of the picture and add the correct dimensions to the <img> tag in the output file.
  • writes the modified file to the same name in the j:\intro\copy subdirectory.

Anybody who wonders why I would be doing things like this has never worked with HTML files created by Microsoft Word's "Save as HTML" command... <g>


Fetching web pages

HtmlZap users frequently ask me how to use the component to parse pages retrieved directly from the Internet. HtmlZap itself doesn't have this functionality: it can only parse HTML contained in a local disk file or a string.

However, there are many components (commercial and otherwise) that can fetch web pages and return them in a form HtmlZap can use. The simplest approach may be to use the Microsoft Internet Transfer Control, which is included with Visual Studio. If you were to place an Internet Transfer Control named ITC on your VB form, you could execute the statement:

     HZ.LoadBuffer ITC.OpenURL("", icByteArray)

This will cause the Internet Transfer Control to connect to the target URL (Google's main page in the example), retrieve its contents as a Byte Array, then pass the array to the HtmlZap component named HZ via the LoadBuffer method. Using a Byte Array is a tiny bit faster than a String (though a String will also work) because it saves a conversion to and from Unicode.

If you'd rather not use the Internet Transfer Control, which doesn't work from scripting languages, you can try a freeware component like AspTear, for example, though I've never used it myself. Several commercial IP component suites include tools that can retrieve pages over HTTP; I've had really good luck with the IP*Works suite from /n software.


The source

    Private Sub Demo()
    ' Parse the intro
    Dim of As Integer
    Dim sf As String, pfn As String
    Dim pict As String

    Const SrcDir = "j:\intro\"          ' Source directory

    HZ.CompressWS = False               ' Don't compress whitespace

    sf = Dir(SrcDir + "page*.htm")      ' Get all page*.htm files

    While sf <> ""

        HZ.Load SrcDir + sf             ' Load a file

        of = FreeFile                   ' Open the copy
        Open SrcDir + "copy\" + sf For Output As #of

        While Not HZ.EOF                ' Loop through the entire source file

            If HZ.IsTag Then            ' This is a tag
                Select Case HZ.TagName  ' Which one?

                    Case "font", "/font", "p"
                        ' Do nothing... just remove these

                    Case "/p"
                        Print #of, "<p>"; ' Convert </p> to <p>

                    Case "img"
                        pfn = LCase$(HZ.Param("src"))
                        FileCopy SrcDir + pfn, "j:\intro\send\" + pfn
                        pict = SrcDir + pfn
                        ' Use an invisible picture control to get picture sizes
                        PicSizer.Picture = LoadPicture(pict)
                        Print #of, "<img src="""; pfn; """ width="; _
                            Format$(PicSizer.ScaleWidth); " height="; _
                            Format$(PicSizer.ScaleHeight); ">";

                    Case Else               ' All other tags
                        Print #of, "<"; HZ.ToString; ">";

                    End Select
                Print #of, HZ.Text;         ' Just transcribe text unchanged
                End If

            HZ.Next                         ' Get the next slice

        Close #of                           ' Close the copy
        HZ.Reset                            ' Reset the Html Zapper


        sf = Dir                            ' Get next filename

    MsgBox "Done", vbOKOnly, "Demo Copy"

    End Sub

Last revised: 24 September 2002