dwww Home | Show directory contents | Find package

The TRAVERSAL code from old versions of Lynx has been upgraded by David
Mathog (mathog@seqaxp.bio.caltech.edu) so that it works again, can be
implemented via a command line switch (-traversal) instead of via a
compilation symbol for creating a separate Lynx executable as in those
previous versions, and can be used in conjunction with a -crawl switch
to make Lynx a front end for a Web Crawler.
 

Usage:

   lynx [-traversal] [-realm] [-crawl] ["startpage"]


Added switches are:

  -traversal      Follow all http links derived from startpage that are
                  on the same server as startpage.  If startpage isn't
                  specified then the traversal begins with the default
                  startfile or WWW_HOME.

  -realm          Further restrict http links to ones in the same realm
                  (having a matching base URI) as the startpage (e.g.,
                  http://host/~user/ will restrict the traversal to that
                  user's public html tree).

  -crawl          With [-traversal] outputs each unique hypertext page
                  as an lnk###########.dat file in the format specified
                  below.  With [-dump] outputs only the startpage, in
                  the same format, to stdout.


Note on startpage:

                  If a startpage is specified and contains any uppercase
                  characters, on VMS it should be enclosed in double-quotes.
                  The code that extracts the access and host fields from
                  startpage for comparisons with links to ensure they are
                  not on another server, and the comparisons with already
                  traversed links, are case sensitive, and the startpage
                  will go to all lowercase on VMS if no double-quotes are
                  supplied, such that it might be treated as a new link if
                  encountered with uppercase letters.


Files created and/or used with the -traversal switch, based on definitions
in userdefs.h:

TRAVERSE_FILE (traverse.dat):
                  Contains a list of all URLs that were traversed.  Note
                  that if a URL appears in this file it will not be 
                  traversed again (important if runs are started and 
                  stopped).  Placing an entry in this file BEFORE the
                  run will block traversal of that URL.  Unlike reject.dat
                  a final * has no effect (see below).  Note that Lynx
                  internal client-side image MAP URLs will be included in
                  this file (e.g., LYNXIMGMAP:http://server/foo.html#map1),
                  in addition to the "real" (external) http URLs.

TRAVERSE_FOUND_FILE (traverse2.dat):
                  Contains a list of all URLs that were traversed, in the
                  order encountered or re-encountered (but not re-travered)
                  during a traversal run, and the TITLEs of the documents
                  (separated from the URLs by TABs)  A URL and TITLE may be
                  present in this list many times.  To simplify the list,
                  on VMS use:  sort/nodups traverse2.dat;1 ;2
                  Note that the URLs and TITLEs of the Lynx internal
                  client-side image MAP pseudo-documents will not be
                  included in this file, though "traversed", but only the
                  http URLs and TITLEs derived from the MAP's AREA tag
                  HREFs that were traversed.

TRAVERSE_REJECT_FILE (reject.dat):
                  Contains a list of URLs that have been rejected from the
                  traversal.  Once a URL has been entered in this list, it
                  will not be traversed.  URLs that end in a * will cause
                  rejection of all URLs that match up to the character before
                  the *. So for instance, to reject all htbin references on a
                  site put this line in the reject.dat file BEFORE starting
                  the run:  http://www.site.wherever:8000/htbin*

TRAVERSE_ERRORS (traverse.errors):
                  A list of links that could not be accessed or had an
                  unknown status returned by the http server.  If the
                  owner of the document containing the link is know via
                  a LINK REV="made" HREF="mailto:foo" in it and the
                  MAIL_SYSTEM_ERROR_LOGGING was set true in userdefs.h
                  or lynx.cfg (not recommended!!!), a message about the
                  problem will be mailed to the owner as well. 


Files created during traversals if the -crawl switch is included with the
-traversal switch:

lnk########.dat   Numbered output files containing the contents of traversed
                  hypertext documents in text format.  All hypertext links
                  within the document have been stripped, and the URL and
                  TITLE of the document are recorded as the first two lines,
                  e.g., for the seqaxp.bio.caltech.edu home page the first
                  two lines will be:

                  THE_URL:http://seqaxp.bio.caltech.edu:8000/
                  THE_TITLE:SAF Web server home page

                  The VMSIndex software is being adapted to use this
                  information to extract the corresponding URL and TITLE
                  for use in indexing the lnk########.dat files, e.g.:

                  $ build_index -
                    /url=(text="THE_URL:") -
                    /topic=(text="THE_TITLE:",EXCLUDE) -
                    /output=INDEX_NAME -
                    lnk*.dat

                  A clever person should be able to figure out a way to
                  index the lnk########.dat files on Unix as well.

                  If you want the hypertext links in the document to be
                  numbered, include the -number_links switch.  By default,
                  this will cause the list of References (URLs for the
                  numbered links) to be appended as well.  If you want
                  numbered links but not the References list, include the
                  -nolist switch as well.

                  Note that any client-side image MAP pseudo documents
                  that were "traversed" will not have lnk########.dat
                  output files created for them, but output files will
                  be created for "real" documents that were traversed
                  based on the HREFs of the MAP's AREA tags.

This functionality is still under development.  Feedback and suggestions
are welcome.

Generated by dwww version 1.15 on Sun Jun 30 09:56:36 CEST 2024.