» tagged pages
» logout

sorted by: recent | see : popular
Content Tagged with webir + open-source

Jeff's Search Engine Caffè: Current Open Source Search Engine Libraries

"Here is my short list of the most important open source [free] information retrieval libraries being used today that are undergoing active development as of writing."

open-source: del.icio.us tag/open-source

COSIN - WP5 - index

"The main topic of the COSIN project is to develop a series of theoretical, graphical, analytical and computational tools to describe the complex behaviour of networks."

open-source: del.icio.us tag/open-source

Grub's Distributed Web Crawling Project

"Grub started back in 2000 with a simple concept of distributing part of the search process pipeline: crawling."

open-source: del.icio.us tag/open-source

WebLA :: Web Linkage Analysis

"WebLA is a Java package for handling Web Graphs, implementing popular algorithms such as PageRank, HITS, CoCitation Similarity and SimRank. It is of particular interest for research in Information Retrieval, [...]"

open-source: del.icio.us tag/open-source

Swish-e :: Home Page

"Swish-e is a fast, flexible, and free open source system for indexing collections of Web pages or other files. Swish-e is ideally suited for collections of a million documents or smaller."

open-source: del.icio.us tag/open-source

TCatNG Toolkit :: Text Categorization via N-Grams

"The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files. [Namely] categorizing documents by topic, detecting the author of a text, or recognizing the language [...]"

open-source: del.icio.us tag/open-source

Focused crawler - Combine System Homepage

"Combine is an open system for crawling [harvesting and threshing (indexing)] Internet resources. It can be used both as a general and focused crawler."

open-source: del.icio.us tag/open-source

Heritrix - Home Page

"Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project."

open-source: del.icio.us tag/open-source

WIRE (Web Information Retrieval Environment)::Center for Web Research

"The WIRE project is an effort started by the Center for Web Research for creating an application for information retrieval, designed to be used on the Web."

open-source: del.icio.us tag/open-source

Welcome to Nutch!

"Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc."

open-source: del.icio.us tag/open-source