» tagged pages
» logout

(Feed found, click Add Page to syndicate.) Error finding feed, please try again » Find feed title

A Blog Page allows you to add entries, for news or other time sensitive postings

(Login required to save to your tagged pages.)
(or Cancel)

Make further edits, (or Cancel)

(Login required to save to your tagged pages.)
(or Cancel)

(Editing anonymously: to be credited for your changes, login or register a new account)

Change Page Permissions? Changing these permissions will adjust who can modify this page.

cornelius (change)
Swik Users (change)
(or Cancel)
Upload an image from your computer:
or Copy an image from a URL:
or Erase the current icon:
Icon Preview:

or Cancel

Erase Xapian? The contents of Xapian page and all pages directly attached to Xapian will be erased.

or Cancel

(Editing anonymously: to be credited for your changes, login or register a new account)

other page actions:
Xapian

Xapian

Tags Applied to Xapian

2 people have tagged this page:

Xapian is a full text search engine. It is written in C++ , but has all the bindings you would probably want:

  • SWIG bindings. Tested for: Python, PHP, TCL, C#, and Ruby
  • Java JNI bindings are included in the xapian-bindings module
  • Perl XS bindings are available from CPAN in module Search::Xapian

Bindings

  • Xapwrap – a Python binding for Xapian

See Also

  • Lucene – a Java Apache search engine
www.xapian.org
GNU General Public License ()

sorted by: recent | see : popular
Content Tagged Xapian

FAQ page about python wrappers

I've added an entry to the FAQ about the many Python wrappers now available for Xapian, in an attempt to help new users work out which Python interface to Xapian would be sensible for them to use. It's at http://trac.xapian.org/wiki/FAQ/PythonWrappers and could do with having some information on ore.xapian and mbxap added. Also, if you know of any Python wrappers around Xapian which aren't listed there, please add them (or reply to this message with details). Finally, it's probably biased towards the Xappy wrappers, since I have a vested interest in them, so if you feel it appropriate to edit out such bias, I won't be offended! I have a feeling there might be a similar number of PHP wrappers around out there, but I haven't used them - if there are, it would be great if someone could start a FAQ/PhpWrappers page listing them...

Xapian: Xapian General Mailing List

Re: Getting started with Xapian in Perl

This is search-xapian in the Xapian SVN user, so it's already in SVN, just not in the main repo. It's possible some permissions in there are wrong, but I don't remember who set that up (certainly wasn't me), so I'm not sure what to look for. Everything looks broadly right to me, but I'm not used to playing with the internal SVN files yet... J

Xapian: Xapian General Mailing List

Re: Can someone please explain this paragraph?

I've linked to the definition in the glossary, which seems better than duplicating text, or writing another definition of RSet. The whole document would benefit from an overhaul - it's a bit formal in places where it doesn't help, and somewhat opinionated. I've made a few changes, but feel free to improve it further. Cheers, Olly

Xapian: Xapian General Mailing List

Re: Getting started with Xapian in Perl

Looks like a permissions problem on the new server - is this easy to fix, James? If not, I should sort out migrating Search-Xapian into the main SVN tree with a bit more urgency... Meanwhile, you can find these in the examples subdirectory of the Search::Xapian source distribution. Cheers, Olly

Xapian: Xapian General Mailing List

Re: "Similar documents"

I've just added an FAQ entry for this: http://trac.xapian.org/wiki/FAQ/FindSimilar Cheers, Olly

Xapian: Xapian General Mailing List

Re: Improving indexing speed

On Jul 1, 2008, at 5:49 PM, James Aylett wrote: Dual quad core Intel Xeon 2.33GHz Linux 2.6.24-16-server (x86_64) Ubuntu 8.04 Two, in a RAID-1 configuration. RAID Controller: AMCC 9650SE-2LP DISK ATA WDC WD2500YS-01S (250GB 7200 RPM 16MB Cache SATA 3.0Gb/s) ATA WDC WD2500YS-01S (250GB 7200 RPM 16MB Cache SATA 3.0Gb/s) Let me know if there is more info you'd like... -- --ruaok Somewhere in Texas a village is *still* missing its idiot. Robert Kaye -- robeorbit.net -- http://mayhem-chaos.net

Xapian: Xapian General Mailing List

Re: Help with weights

On Jul 1, 2008, at 6:14 PM, Olly Betts wrote: Ah ha -- that explains it -- thanks. OK, I see how that can be really useful. Since I am providing an end user search service, should I write my own parser and generate my own queries or should I post-process the results from QueryParser to tack on the fields that would give the user better search results? -- --ruaok Somewhere in Texas a village is *still* missing its idiot. Robert Kaye -- robeorbit.net -- http://mayhem-chaos.net

Xapian: Xapian General Mailing List

Exact matches

My saga of tweaking Xapian to work right for me continues -- last night I figured out the core issue that I am having and I'm hoping you folks can direct me how to tweak xapian to make it behave right. Consider the search for an artist. Let's consider "Duran Duran" as an example right now. Without any weighting tricks, when I search for Duran Duran I get: 100Duran Duran Duran 88Duran Duran 66Mike Duran 66Duran Y Garcia 66Duran (FYI, Duran Duran Duran is a valid band, sigh) As a user I would expect to see "Duran Duran" at 100% since my query matches one document in the database EXACTLY. In text searching terms, I understand the result since more occurrences of the word ought to yield a higher score. But for my round-pep-into-a-square hole approach of searching my SQL database with xapian, this isn't the best result. Is there any way I can tweak Xapian to move exact matches to 100% and matches that have more/fewer terms lower? -- --ruaok Somewhere in Texas a village is

Xapian: Xapian General Mailing List

Re: Help with weights

Incidentally, if you want to see why the weighting schemes work like this, consider the case of a database with two documents, one of which contains all the text from the first twice. You probably want to give these similar weight - certainly the doubled document shouldn't get twice the weight for most applications. For BM25 you can adjust a parameter to tune how much influence the document length has. If you're happy using QueryParser, just apply the above to the Query object it produces (i.e. query in the above code snippet comes from QueryParser). Cheers, Olly

Xapian: Xapian General Mailing List

Re: Exact matches

I would suggest that you take the artist name, normalise it (squash punctuation and whitespace to a single space or nothing and casefold; perhaps drop a leading "the" or trailing ", the") and add this as a prefixed term. Then do the same to the query string and postprocess the parsed query by adding this term, probably with OP_OR so that punctuation normalisation works even if the parsed query doesn't match). So in this case, the artist term would be "XARTISTduran duran" or perhaps "XARTISTduranduran". Cheers, Olly

Xapian: Xapian General Mailing List

Re: OT: index compression

Thanks for the link. There are some interesting ideas there. One issue they don't consider though (which is important for us) is being able to efficiently splice new entries into existing posting lists (they don't look at indexing speed at all saying it's a "one-time operation" so it seems they're only thinking about the non-incremental case). That's the big benefit of the encoding we currently use for posting lists (which for flint is what they call "vbyte"). Cheers, Olly

Xapian: Xapian General Mailing List

Re: Improving indexing speed

Couple of detail questions: * what processor? * what OS? * how many spindles behind the FS volume? * what hard disks? All hard data is good data, but obviously it's even better if there's context as well -- apologies if you're already given any of these details, but I didn't notice them recently in the thread. (By the way, slamming the disks during index is what you want to do unless you're also searching off the same database. A breakdown of the type of CPU usage will help analysis here -- iowait versus sys/user will tell you when you're starting to become IO bound. 4-6 processes to max out your storage is pretty good :-) J

Xapian: Xapian General Mailing List

Re: Help with weights

I assume you're using scriptindex, and are turning that scriptindex input field into a term (probably a boolean term). Are you applying the termcount *solely* to the type field? That won't do much here, and other emergent properties are giving you the result you see. If you're bumping the termcount to 100 for all terms you generate for a Xapian document with type=album, that's a different matter. From your description, I don't think you are doing - can you confirm one way or the other? J

Xapian: Xapian General Mailing List

Re: Help with weights

Are you adding this type term to queries? If not, the effect of indexing the type term with those termcounts will be to increase the document length of albums. That will tend to decrease the importance of each occurrence of "love" in the album title, so albums will indeed tend to rank lower. Perhaps a better approach would be to keep the type term with wdf 1 regardless of the type, and then take your query and adjust it like so: Xapian::Query album_boost("XTYPEalbum"); album_boost = Xapian::Query(Xapian::Query::OP_SCALE_WEIGHT, album_boost, 4.2); query = Xapian::Query(Xapian::Query::OP_AND_MAYBE, query, album_boost); You can adjust the 4.2 factor to alter how much albums are boosted, and you can also search "fairly", or boost individual tracks instead if you prefer - and none of this requires a reindex. Cheers, Olly

Xapian: Xapian General Mailing List

Help with weights

Hi! Everytime I think I've got the xapian search for MusicBrainz licked I ask for more feedback and my community finds yet another test case that throws a monkey-wrench into my project. And the more I try to understand Xapian's weighting system, the less I really understand it. Let me ask a specific question -- in my release index (an index of CD titles, essentially) I have a field called type. When the value of this field is "album" I give it a termcount of 100. All other values for this field and all other fields get a termcount of 1. For the enquire, I use a stock object. I do not define a weighting system, do not tinker with doc order or sort order. When I search for the term "love" in the release title (very common term), the top hits are the ones that contain the word "love" twice. Good. But, for all the hits that have the word "love" in them once, I would expect to see the releases of type "album" to be near the top. But they are not: http://musicbrainz.homeip.net/search/tex

Xapian: Xapian General Mailing List

Re: Improving indexing speed

On Jun 30, 2008, at 11:57 PM, Olly Betts wrote: Ok, I've tinkered with the setup a bit. I've found that if I give xapian loads and loads of RAM, it doesn't even get around to using all the RAM I give it -- at most each process used 5% of 8G of RAM. I measured disk access with: iostat -x 10 (10 second disk usage average window) And CPU util with top. I've found: 3 processes: 95% - 96% CPU usage for each process, 40%-60% disk usage 4 processes: 95% - 96% CPU usage for each process, 60%-90% disk usage 5 processes: 92% - 94% CPU usage for each process, 80%-100% disk usage 6 processes: 91% - 93% CPU usage for each process, 100% disk usage sustained It looks like 4 processes is the sweet spot that doesn't utterly slam the machine. This is much better than I had anticipated -- well done Xapian team! -- --ruaok Somewhere in Texas a village is *still* missing its idiot. Robert Kaye -- robeorbit.net -- http://mayhem-chaos.net

Xapian: Xapian General Mailing List

Page 1 | Next >>
Username:
Password:
(or Cancel)