I've added an entry to the FAQ about the many Python wrappers now
available for Xapian, in an attempt to help new users work out which
Python interface to Xapian would be sensible for them to use.
It's at http://trac.xapian.org/wiki/FAQ/PythonWrappers and could do with
having some information on ore.xapian and mbxap added. Also, if you
know of any Python wrappers around Xapian which aren't listed there,
please add them (or reply to this message with details). Finally, it's
probably biased towards the Xappy wrappers, since I have a vested
interest in them, so if you feel it appropriate to edit out such bias, I
won't be offended!
I have a feeling there might be a similar number of PHP wrappers around
out there, but I haven't used them - if there are, it would be great if
someone could start a FAQ/PhpWrappers page listing them...
This is search-xapian in the Xapian SVN user, so it's already in SVN,
just not in the main repo. It's possible some permissions in there are
wrong, but I don't remember who set that up (certainly wasn't me), so
I'm not sure what to look for.
Everything looks broadly right to me, but I'm not used to playing with
the internal SVN files yet...
J
I've linked to the definition in the glossary, which seems better than
duplicating text, or writing another definition of RSet.
The whole document would benefit from an overhaul - it's a bit formal
in places where it doesn't help, and somewhat opinionated. I've made
a few changes, but feel free to improve it further.
Cheers,
Olly
Looks like a permissions problem on the new server - is this easy to
fix, James? If not, I should sort out migrating Search-Xapian into
the main SVN tree with a bit more urgency...
Meanwhile, you can find these in the examples subdirectory of the
Search::Xapian source distribution.
Cheers,
Olly
On Jul 1, 2008, at 5:49 PM, James Aylett wrote:
Dual quad core Intel Xeon 2.33GHz
Linux 2.6.24-16-server (x86_64)
Ubuntu 8.04
Two, in a RAID-1 configuration.
RAID Controller: AMCC 9650SE-2LP DISK
ATA WDC WD2500YS-01S (250GB 7200 RPM 16MB Cache SATA 3.0Gb/s)
ATA WDC WD2500YS-01S (250GB 7200 RPM 16MB Cache SATA 3.0Gb/s)
Let me know if there is more info you'd like...
--
--ruaok Somewhere in Texas a village is *still* missing its idiot.
Robert Kaye -- robeorbit.net -- http://mayhem-chaos.net
On Jul 1, 2008, at 6:14 PM, Olly Betts wrote:
Ah ha -- that explains it -- thanks.
OK, I see how that can be really useful. Since I am providing an end
user search service, should I write my own parser and generate my own
queries or should I post-process the results from QueryParser to tack
on the fields that would give the user better search results?
--
--ruaok Somewhere in Texas a village is *still* missing its idiot.
Robert Kaye -- robeorbit.net -- http://mayhem-chaos.net
My saga of tweaking Xapian to work right for me continues -- last
night I figured out the core issue that I am having and I'm hoping you
folks can direct me how to tweak xapian to make it behave right.
Consider the search for an artist. Let's consider "Duran Duran" as an
example right now. Without any weighting tricks, when I search for
Duran Duran I get:
100Duran Duran Duran
88Duran Duran
66Mike Duran
66Duran Y Garcia
66Duran
(FYI, Duran Duran Duran is a valid band, sigh)
As a user I would expect to see "Duran Duran" at 100% since my query
matches one document in the database EXACTLY. In text searching terms,
I understand the result since more occurrences of the word ought to
yield a higher score. But for my round-pep-into-a-square hole approach
of searching my SQL database with xapian, this isn't the best result.
Is there any way I can tweak Xapian to move exact matches to 100% and
matches that have more/fewer terms lower?
--
--ruaok Somewhere in Texas a village is
Incidentally, if you want to see why the weighting schemes work like
this, consider the case of a database with two documents, one of which
contains all the text from the first twice. You probably want to give
these similar weight - certainly the doubled document shouldn't get
twice the weight for most applications.
For BM25 you can adjust a parameter to tune how much influence the
document length has.
If you're happy using QueryParser, just apply the above to the Query
object it produces (i.e. query in the above code snippet comes from
QueryParser).
Cheers,
Olly
I would suggest that you take the artist name, normalise it (squash
punctuation and whitespace to a single space or nothing and casefold;
perhaps drop a leading "the" or trailing ", the") and add this as a
prefixed term.
Then do the same to the query string and postprocess the parsed query by
adding this term, probably with OP_OR so that punctuation normalisation
works even if the parsed query doesn't match).
So in this case, the artist term would be "XARTISTduran duran" or
perhaps "XARTISTduranduran".
Cheers,
Olly
Thanks for the link.
There are some interesting ideas there. One issue they don't consider
though (which is important for us) is being able to efficiently splice
new entries into existing posting lists (they don't look at indexing
speed at all saying it's a "one-time operation" so it seems they're
only thinking about the non-incremental case). That's the big benefit
of the encoding we currently use for posting lists (which for flint is
what they call "vbyte").
Cheers,
Olly
Couple of detail questions:
* what processor?
* what OS?
* how many spindles behind the FS volume?
* what hard disks?
All hard data is good data, but obviously it's even better if there's
context as well -- apologies if you're already given any of these
details, but I didn't notice them recently in the thread.
(By the way, slamming the disks during index is what you want to do
unless you're also searching off the same database. A breakdown of the
type of CPU usage will help analysis here -- iowait versus sys/user
will tell you when you're starting to become IO bound. 4-6 processes
to max out your storage is pretty good :-)
J
I assume you're using scriptindex, and are turning that scriptindex
input field into a term (probably a boolean term). Are you applying
the termcount *solely* to the type field? That won't do much here, and
other emergent properties are giving you the result you see.
If you're bumping the termcount to 100 for all terms you generate for
a Xapian document with type=album, that's a different matter. From
your description, I don't think you are doing - can you confirm one
way or the other?
J
Are you adding this type term to queries? If not, the effect of
indexing the type term with those termcounts will be to increase the
document length of albums. That will tend to decrease the importance of
each occurrence of "love" in the album title, so albums will indeed tend
to rank lower.
Perhaps a better approach would be to keep the type term with wdf 1
regardless of the type, and then take your query and adjust it like so:
Xapian::Query album_boost("XTYPEalbum");
album_boost = Xapian::Query(Xapian::Query::OP_SCALE_WEIGHT, album_boost, 4.2);
query = Xapian::Query(Xapian::Query::OP_AND_MAYBE, query, album_boost);
You can adjust the 4.2 factor to alter how much albums are boosted, and
you can also search "fairly", or boost individual tracks instead if you
prefer - and none of this requires a reindex.
Cheers,
Olly
Hi!
Everytime I think I've got the xapian search for MusicBrainz licked I
ask for more feedback and my community finds yet another test case
that throws a monkey-wrench into my project. And the more I try to
understand Xapian's weighting system, the less I really understand it.
Let me ask a specific question -- in my release index (an index of CD
titles, essentially) I have a field called type. When the value of
this field is "album" I give it a termcount of 100. All other values
for this field and all other fields get a termcount of 1.
For the enquire, I use a stock object. I do not define a weighting
system, do not tinker with doc order or sort order. When I search for
the term "love" in the release title (very common term), the top hits
are the ones that contain the word "love" twice. Good.
But, for all the hits that have the word "love" in them once, I would
expect to see the releases of type "album" to be near the top. But
they are not:
http://musicbrainz.homeip.net/search/tex
On Jun 30, 2008, at 11:57 PM, Olly Betts wrote:
Ok, I've tinkered with the setup a bit. I've found that if I give
xapian loads and loads of RAM, it doesn't even get around to using all
the RAM I give it -- at most each process used 5% of 8G of RAM.
I measured disk access with:
iostat -x 10 (10 second disk usage average window)
And CPU util with top. I've found:
3 processes: 95% - 96% CPU usage for each process, 40%-60% disk usage
4 processes: 95% - 96% CPU usage for each process, 60%-90% disk usage
5 processes: 92% - 94% CPU usage for each process, 80%-100% disk usage
6 processes: 91% - 93% CPU usage for each process, 100% disk usage
sustained
It looks like 4 processes is the sweet spot that doesn't utterly slam
the machine. This is much better than I had anticipated -- well done
Xapian team!
--
--ruaok Somewhere in Texas a village is *still* missing its idiot.
Robert Kaye -- robeorbit.net -- http://mayhem-chaos.net