Search Performance Statistics
From CommerceNet Wiki
by Kragen Sitaker; see also Publications
2004-09-28
This regards an indexer very much inspired by Lucene.
Here's what I get running my current indexer on my 150-million-byte current mailbox:
kragen@pragmatic:~/sdc1/mail$ time ./reindexmail
real 4m58.906s user 0m31.430s sys 0m2.580s
real 0m58.883s user 0m29.380s sys 0m0.770s
real 5m59.752s user 1m0.820s sys 0m3.360s
Reindexmail is as follows:
time ~/devel/maildex tmp.mboxtail > tmp.mboxtail.idx/contents
time PYTHONPATH=~/devel python -c '__import__("maildex").lindb("tmp.mboxtail.idx").headwords("contents")'
That is to say, it take six minutes to fully index the mail, of which the first 4:59 are spent indexing it, and the remaining 59 seconds are spent making headwords for the index. This is using the "fast mbox inverted indexing" thing I posted to kragen-hacks in April to do the mail indexing, and the Python stuff that preceded it to build the headwords. And all of this is on my 500MHz PIII laptop.
This adds up to only 416 thousand bytes per wallclock second, but you may notice that the CPU times are much less --- totaling 64 seconds, or 2.3 million bytes per CPU second. This discrepancy is largely because this indexer uses nearly all of memory for a large hash table. These numbers get marginally better when there is nothing else happening on the machine, but not enormously --- apparently even idle applications usually create significant memory pressure.
You can ameliorate this problem by indexing the mailbox in segments, as I did in the original Python indexer, and merging the segments together --- an approach very much inspired by Lucene. The Python indexer's raw speed, before merging, is only about 220 thousand bytes per second on this same machine, and a substantial part of the Python indexer's run time is in the multiway merging phase --- indexing the same mailbox took 1068 seconds, of which 337 were spent in merging. So using the Python merging phase on the C indexer's output would defeat the purpose of efficiency, since the Python merging phase would take far longer than the indexing phase. Consequently I would need to speed up the indexing phase substantially, probably by rewriting it, too, in C, to be reasonably fast --- I see no reason to expect it to take a larger or smaller part of the time than in the Python world.
That's why I haven't done it so far, preferring to just use machines with more RAM.
By contrast, running Nutch over mail from the same mailbox, split out into one HTML file per original message plus one HTML file per email address, yields speeds in the status: lines around 480 kilobits per second --- or 60 kilobytes per second --- on a 2GHz Athlon, which CPU is roughly four times as fast as the one that yielded the above measurements. I'm not sure why my Python is ten times as fast as Nutch, although the C indexer is perhaps another ten times as fast as the Python version. But here's a list of things Nutch is doing that my toy indexers aren't:
- saving a copy of the content
- parsing HTML
- omitting stopwords
- converting character sets
- doing dynamic dispatch to select tokenizers
- handling multiple fields per document
- handling a new file per message
- inserting extra tokens to facilitate phrase searching
- stemming
- saving off HTML link structure
- following broken links (there are lots of them in the mboxburst'ed output)
- logging progress messages
Still, I'm not sure how all these add up to a factor of 100 over just reading and indexing all the text.
