lyndex.sh - Lynx-based site indexer
I was impressed by the Lynx’s crawling ability so much, that I write this indexer script to process it’s result. Not much theory, just a glimpse
at the CRAWL.announce then a nice job of text processing. Fortunately Lynx does all the hard job, of
extracting URL and stripping the HTML tags, but sadly, some useful information is going away too. Like keywords and description meta
tag values and the distinction of
heading elements. So this is just a simple solution, not a good one.
Currently my site is quite small, but the ratio says to me that is effective enough. Ok, I know, my English is poor and I use a limited amount of words, so is not a big deal to index them. But I hope this ratio will worsen in time.
Stage | Files | Size |
---|---|---|
full site | 62 | 594 Kb |
crawled text | 48 | 242 Kb |
temporary data | 1 | 95 Kb |
resulted info | 5 | 44 Kb |
lyndex.sh spent only a few seconds on my site, but may work for long time. During the work it writes something about it’s activity :
Usage
lyndex.sh is expects no instructions directly. Usually there should be one configuration file for each site you want to index, then you just pass the configuration file’s name to work with :
Lyndex version 1.0b october 2005 written by Feherke create a word index from the Lynx traversal crawl files Syntax : lyndex.sh [inifile] Parameter : inifile - crawling settings file ( lyndex.ini )
lyndex.sh only builds some indexes then lets the user to do whatever he wants with them :
Lyndex version 1.0b october 2005 written by Feherke Reading settings... Ok ( ./lyndex.ini ) Creating work directory... Ok ( lnkyLVdys ) Setting up restrictions... Ok ( 1 ) Composing user-agent... Ok ( Lyndex/1.0 Lynx/2.8.4rel.1 (Linux 2.4.19-4GB i686; console) ) Crawling http://rootshell/ ... Ok ( 234K ) Removing old data... Ok ( 5 files ) Generating checksums... Ok ( 44 entries ) Excluding duplicates... Ok ( 2 pieces ) Creating file list... Ok ( 42 files ) Extracting words... Ok ( 5252 entries ) Calculating maximum relevance... Ok ( 22 ) Parsing words... Ok ( 1351 words ) Creating table of contents... OK ( 26 letters ) Creating general file... OK ( 2 items ) Checking for spelling errors... OK ( 268 occurences ) Removing temporary data... Ok ( 322K ) Removing work directory... Ok ( lnkyLVdys ) Done ( 10 seconds )
Configuration
The configuration file
- is mandatory
- can be specified in command line
- by default is located in the same directory with the script
- by default has the same name with the script, with .sh extension removed and .ini extension added
- is sourceable shell script
# Lyndex version 1.0b october 2005 written by Feherke
# create a word index from the Lynx traversal crawl files
# settings file
# base url and starting point for crawl from
# should be the highest level directory from the structure
starturl="http://example.com/"
# minimum word length
# words with less then this many characters will be excluded
minword=3
# maximum ASCII-art length
# words containing one character more then this many time will be excluded
maxaart=3
# maximum percent of pages, including a relevant word
# if more then this many pages contains it, the word is not relevant
# the word could be irrelevant only if all pages contain it in same count
maxrele=90
# decrement almost irrelevant word scoring
# for the words above relevance limit, but with different count per page
# the count could be
decirel="yes"
# size of page sample in characters
# will be displayed in the page list
sample=250
# directory to exclude from crawling
# more then one directories can be specified in an array :
# nocrawl=( "http://example.com/sample/" "http://example.com/temp/" )
nocrawl="http://example.com/private/"
# checksum calculation tool
# used to detect multiple instances of a page
# some possible values :
# cksum | md5sum | sha1sum | sum
# du -b | wc -c
checksum="md5sum"
# spelling verification tool
# if specified, will be used to check the word list
# some possible values :
# ispell -a | aspell -a
shellcheck="ispell -a"
Versions
- 1.0 - September 2005
- Initial release.
- 1.0b - October 2005
- Exclude duplications when more URLs are pointing to the same content.
- Exclude too short words ( option
minword
). - Poor ASCII art recognition by watching characters repeated multiple times ( option
maxaart
). - Spell checking ( option
spellcheck
).
Plans
- Documentation.
Download
You can find the related files on GitHub in my Bash-script repository’s lyndex directory :
- lyndex.sh - script
- lyndex.ini - configuration