» tagged pages
» logout
Xapian
Return to Xapian

Xapian General Mailing List

(or Cancel)

(Editing anonymously: to be credited for your changes, login or register a new account)

other page actions:

Tags Applied to this Topic

1 person has tagged this page:

Xapian Wiki Pages

Monday, September 01, 2008

Making SORTAFTER useful in omega?

Hello List,
Our users keep asking for some more "logical" sorting of search results. Now the results are sorted on relevance, i.e. the raw weight, by default. But since the users only see the percentage, that results in a seemingly random secondary sorting.
According to the docs and earlier mails, omega has the 'SORTAFTER' (and docid sorting) functionality to allow date-based secondary sorting. But according to later mails and the documentation that's only useful if you don't use the default BM25-weighting. Unfortunately you can't alter the weighting scheme via Omega-calls. Nor does it seem to help to simply patch query.cc to use TradWeight rather than BM25.
Since we've built our set-up around omega, we'd rather not have to build something similar or patch omega just because its missing a small but important feature. Is it somehow possible to make the newer results in the seemingly similarly relevant results sort on top within Omega?
Best regards,
Arjen

Monday, September 01, 2008

Re: Makefile:379: .deps/xapian_wrap.Plo: No suchfile or directory

[...]
The way the system works, these files get used even if you aren't building those bindings. I can't remember details, but I'm pretty sure that you'll need to fix this before proceeding. tcl8/Makefile.in should be in the archive you downloaded, or if you're using a SVN snapshot it should be created by bootstrapping. Can you double-check where you got the source from?
J

Monday, September 01, 2008

Re: PHP error in index_text()

Is it definitely 1 and not "1"? That would trigger this, because the bindings (I believe) are fairly type strict here. (In general they have to be in order to get overloaded function calls to work sanely.)
Try also just dropping the $weight parameter, to ensure that $text is really a string, and not something that tests == in PHP to "" such as NULL, 0 etc..
J

Monday, September 01, 2008

Makefile:379: .deps/xapian_wrap.Plo: No such fileor directory

Briefly: when I try to build xapian-bindings the make command fails with: "Makefile:379: .deps/xapian_wrap.Plo: No such file or directory"
When I run configure these are the last few lines: I tried running "./configure" and "./configure --without-java --without-tcl --without-tcl8 --without-csharp --without-perl --with-python " with no difference. Note the tcl msg. I'm not sure if it's spurious or not

Monday, September 01, 2008

Re: using xapian for indexing mails [SOLVED]

n-gram analysis works pretty well..
In a nutshell it works like this:
Step 1. Training: With sample texts in various languages by produce n- grams, keep the most popular N n-grams for each language where N is sufficiently large. Step 2. Analysis: Compare the number of matching of n-grams from the unknown language text to the n-gram samples from each language. The language with the most matches is probably the language of that text.
See: http://www.rubyinside.com/whatlanguage-ruby-language-detection-library-1085.html http://code.activestate.com/recipes/326576/
Regards,
Rusty -- Rusty Conover InfoGears Inc. / www.GearBuyer.com / www.FootwearBuyer.com http://www.infogears.com

Sunday, August 31, 2008

PHP error in index_text()

Hi there,
I'm using the Debian php5-xapian package version 1.0.7-3~bpo40+1 and I'm getting this error:
Fatal error: No matching function for overloaded 'TermGenerator_index_text' in /usr/share/php/xapian.php on line 1498
The code around it (in xapian.php) is this:
function index_text($text,$weight=1,$prefix=null) { switch (func_num_args()) { case 1: case 2: TermGenerator_index_text($this->_cPtr, $text,$weight); break; default: TermGenerator_index_text($this->_cPtr,$text, $weight,$prefix); } }
Line 1498 is the one with "case 1: case 2:".
The call to that function (index_text()) in my PHP script is actually made (in this case) with 2 parameters from my PHP script: - one empty string - 1
I understand that it's wrong to send an empty string as first parameter if that's supposed to be the text we want to index, but is it necessary to trigger a fatal error ? (the message is not very explicit either...)
Cheers,
Yannick

Sunday, August 31, 2008

Re: using xapian for indexing mails [SOLVED]

On Sat, 30 Aug 2008, djcb wrote:
Thanks all for the quick replies!
Matthew Somerville mysociety.org> wrote:
Ah, that did the trick, great! I now integrated Xapian with my code, and it seems to work nicely. I'll take a look at some of the other indexers that were mentioned.
I noticed that the stemming is language-specific (understandably); is there some recommended way to guess the language of a blob of text? For me, speed is more important than 100% accuracy (which would be hard anyway, and consider multi-language text etc...)
BTW, my little maildir indexer/searcher 'mu': http://www.djcbsoftware.nl/code/mu/
Version 0.1 does not have Xapian-search yet, but 0.2 will :-)
Best wishes, Dirk.

Saturday, August 30, 2008

Re: using xapian for indexing mails

I'd also recommend looking at the GMANE indexer (it is targeted at mboxes though). The Debian list archives uses a derivative of that (together with the omega search engine). The Debian code should also be available somewhere
Kind regards
T.
1. http://people.debian.org/~tviehmann/list-search/ but if that's at all interesting, I'll put up the current stuff, too.

Saturday, August 30, 2008

Re: using xapian for indexing mails

Since you're looking at indexing email, you may like to take a look at my (unreleased) proof-of-concept email search in Python:
It's based around mboxes, not maildirs, so you'd need to make some changes if you wanted to use it. However it's probably more useful in giving a possible way of laying out your document data and term prefixes. All under GPL.
J

Saturday, August 30, 2008

Re: using xapian for indexing mails

You want XapianTermGenerator, which takes a blob of text and adds all the words in it to Xapian. e.g. (snippet of the written-in-PHP http://sandwich.ukcod.org.uk/~matthew/subtitles/?source=1#indexer ):
$indexer = new XapianTermGenerator(); $indexer->set_flags(128); $indexer->set_database($db); # For spelling
[... then for each document ... ]
$doc = new XapianDocument(); $indexer->set_document($doc); $doc->set_data( [...] ); $doc->add_term( [...] ); $doc->add_value( [...] ); $indexer->index_text($text); $db->add_document($doc);
ATB, Matthew

Saturday, August 30, 2008

using xapian for indexing mails

Dear Xapian,
I am writing a little tool for indexing/searching email messages in maildirs.
For indexing the message bodies, Xapian looks like an interesting option, but I have some newbie questions. What I would *like* to do is being able to add the email bodies to the Xapian database, and then be able to search for some words.
I am looking at the Quickstart (http://xapian.org/docs/quickstart.html). and it seems I have to create a Xapian::Document instance, then (1) add document data with set_data and (2) add some search terms with add posting.
I could use the message path as the document data, but what about the search terms? Should I split my body text in words, and add every single one of them as a search term? That does not sound very attractive... I seems that 'recoll' (which uses Xapian) is doing that though.
Or is there some easier way to simply provide blobs of text, and being able to search for them later? I have the feeling I am misunderstanding something....
Hope someone can give me some h

Friday, August 29, 2008

Get matched fields

Hey all,
i build up a user search and want to display fields matched by the given expression. i'm searching through 2 prefix-defined fields (user_interessts, XUSER_INTERESSTS and user_food, XUSER_FOOD) - here is my query:
user_interessts:(disco) OR user_food(pizza)
i want to display the matched field on the searchresult page
user: macguyver matched fields: interessts
any suggestions?
regards, sven ______________________________________________________________ "Hostage" mit Bruce Willis kostenlos anschauen! Exklusiv für alle WEB.DE Nutzer. http://www.blockbuster.web.de

Thursday, August 28, 2008

Re: Best way of showing what matched the search

Hi,
But I still need to identify which "smaller documents" match the search. That's the reason I added the documents twice: all together and separatly.
Also, I would love to know which terms matched the documents I searched (to avoid having to search again inside the document).
Is there any way to get which terms from that query matched a returned document efficiently?
Thanks!
. A .
On Wed, 2008-08-20 at 00:18 +0100, Olly Betts wrote:
_______________________________________________ Xapian-discuss mailing list Xapian-discusslists.xapian.org http://lists.xapian.org/mailman/listinfo/xapian-discuss

Thursday, August 28, 2008

Re: Best way of showing what matched the search

Or perhaps I could instead somehow put some info on the terms of the "virtual bigger document" indicating the origin. Could that be done?
On Wed, 2008-08-20 at 00:18 +0100, Olly Betts wrote:
_______________________________________________ Xapian-discuss mailing list Xapian-discusslists.xapian.org http://lists.xapian.org/mailman/listinfo/xapian-discuss

Thursday, August 28, 2008

Re: Best way of showing what matched the search

This should be pretty efficient:
Xapian::Enquire::get_matching_terms_begin()
If you're finding that's too slow, a testcase would be useful.
Cheers, Olly

Thursday, August 28, 2008

Re: n-gram / cjk serializer

The guy in question is Yung-Chung Lin.
I am using a slightly modified version of his CJKV tokenizer in Pinot to pre-process queries before feeding them to the QueryParser. I chose this route because I didn't want to implement my own query parser and wanted something that works with "mixed" queries.
Look for the QueryModifier class here : http://svn.berlios.de/wsvn/pinot/trunk/IndexSearch/Xapian/XapianEngine.cpp The CJKVTokenizer class is here : http://svn.berlios.de/wsvn/dijon/trunk/cjkv/CJKVTokenizer.cc
For instance, the query "你身体好吗 title:妈妈" will become this : (你 你身 身 身体 体 体好 好 好吗 吗) title:妈 title:妈妈 title:妈
Altogether it seems to work quite well. Of course, any bug is mine not Yung-Chung's :-)
I hope this helps.
Fabrice _______________________________________________ Xapian-discuss mailing list Xapian-discusslists.xapian.org http://lists.xapian.org/mailman/listinfo/xapian-discuss

Thursday, August 28, 2008

wildcards and prefix searches

Hey all,
I'm using Xapian through acts_as_xapian and would like to know why trailing wildcards do not work on prefix searches.
Specifically, I have a Merchant model with a Description field, mapped to the prefix "desc".
If I then search for "desc:goods" I get 4 matches, but if I search for "desc:goo*" I get 0.
However, when searching normal, non-prefixed fields, "goo*" would return at least the same amount of matches as "goods".
Anyone know how I can configure Xapian to support trailing wildcards on prefixes?
thanks, donncha

Thursday, August 28, 2008

How to speed up indexing ?

I'm new to Xapian & need some help, many thanks if anyone replies.
I did a release build from xapian-core-1.0.7 with VS2008 by using Charlie Hull's makefiles.
I'm trying to test-index my dataset -- some 200'000 docs, each document being (on average) 50 bytes long and having 6 words.
I tried (a) not to use stemmer, (b) commit_transaction() on every 50/100/etc. docs, (c) not to use transactions at all -- but in all scenarios indexing goes at ~10 doc/sec or 500 bytes per second.
This should probably be ~400 times faster, I'm clearly doing something wrong. Can anyone give me a hint or direct me to a source on the net to do some reading?
Regards Celto

Thursday, August 28, 2008

Re: How to speed up indexing ?

If you could let us know the platform you're using, and how you're accessing Xapian (which bindings for example, or directly using C/C++?), and even post the code you're using for your indexer, that would help hugely.
Cheers
Charlie

Thursday, August 28, 2008

Re: Patches

It seems worth investigating, though if a lot of files are changing in a short length of time it might struggle to keep up if you only reindex them individually rather than in batches.
An alternative would be to have a "keep scanning" mode for omindex where it uses inotify or similar to monitor indexed files/directories for changes.
Also, how portable is inotify-tools? It seems inotify is Linux-specific and requires Linux >= 2.6.13. So if inotify-tools is inotify-specific, it might be better to use a library which wraps similar features available on other platforms.
No, but it wouldn't be hard to add that.
Cheers, Olly
Page 1 | Next >>
Username:
Password:
(or Cancel)