The extract method is basically done. I’m sure it could be improved a bit more, but it seems to be fairly effective. I added a few extra features beyond the original URI class’s capabilities, such as supplying a base uri to resolve relative uris against. You can also have it return the parsed URIs instead of the strings. At no extra processing cost since it has to parse each URI internally anyways. Tried it out on Sam Ruby’s feed (as you may have noticed, currently my favorite chunk of text to try just about everything out on) and it seems to have gone ok:
>> GentleCMS::URI.extract(text,
:base => "http://www.intertwingly.net/blog/index.atom")
=> ["http://www.w3.org/2005/Atom",
"http://purl.org/syndication/thread/1.0",
"http://www.intertwingly.net/blog/index.atom",
"http://www.intertwingly.net/blog/index.atom",
"tag:intertwingly.net,2004:2340",
"http://www.w3.org/1999/xhtml",
"http://www.tbray.org/ongoing/When/200x/2006/07/07/With-Bloglines-to-Atom",
"http://www.w3.org/1999/xhtml",
"http://www.tbray.org/ongoing/When/200x/2006/07/07/With-Bloglines-to-Atom",
"http://www.w3.org/2000/svg",
"http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.link",
"http://www.atomenabled.org/developers/syndication/atom-format-spec.php#rfc.section.3.1.1",
"http://www.w3.org/TR/2001/REC-xmlbase-20010627/",
"http://www.bloglines.com/preview?siteid=235142",
"http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.author",
"http://www.bloglines.com/preview?siteid=235142",
"http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.source",
"http://www.bloglines.com/preview?siteid=5319444",
"http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.updated",
"http://www.bloglines.com/preview?siteid=2375595",
"http://www.bloglines.com/preview?siteid=50",
"http://www.bloglines.com/preview?siteid=2438392",
"http://weblog.philringnalda.com/2005/12/18/who-knows-a-title-from-a-hole-in-the-ground",
"http://www.niallkennedy.com/blog/archives/2006/07/google-sitemaps-2.html",
"http://www.stephenduncanjr.com/2006/06/atom-10-and-blogger.shtml",
"tag:intertwingly.net,2004:2339",
"http://www.w3.org/1999/xhtml",
"http://www.1060.org/blogxter/entry?publicid=8A0DC194929914711F1C0470FFDB7B73",
"http://www.intertwingly.net/slides/2005/xmlconf/",
"http://www.intertwingly.net/slides/2005/etcon/",
"tag:intertwingly.net,2004:2338",
"http://www.w3.org/1999/xhtml",
"http://www.w3.org/2000/svg",
"http://en.wikipedia.org/wiki/Penrose_tiling",
"http://intertwingly.net/stories/2006/07/06/penroseTiling.svg",
"tag:intertwingly.net,2004:2337",
"http://www.w3.org/1999/xhtml",
"http://www.w3.org/2000/svg",
"http://www.unto.net/unto/work/on-rss-and-atom/",
"http://www.unto.net/unto/opensearch/more-on-rss-and-atom/",
"tag:intertwingly.net,2004:2336",
"http://www.w3.org/1999/xhtml",
"http://intertwingly.net/stories/2006/07/04/clean_utf8_for_xml.c",
"http://www.intertwingly.net/blog/",
"http://www.intertwingly.net/blog/2006/07/08/Bloglines-Edge-Cases",
"http://www.intertwingly.net/blog/2340.atom",
"http://www.intertwingly.net/blog/2006/07/06/Blame-Somebody",
"http://www.intertwingly.net/blog/2339.atom",
"http://www.intertwingly.net/blog/2006/07/06/Penrose-Tiling",
"http://www.intertwingly.net/blog/2338.atom",
"http://www.intertwingly.net/blog/2006/07/04/Just-a-Technical-Detail",
"http://www.intertwingly.net/blog/2337.atom",
"http://www.intertwingly.net/blog/2006/07/04/Clean-utf-8-for-XML",
"http://www.intertwingly.net/blog/2336.atom"]The original’s output:
URI.extract(text)
=> ["http://www.w3.org/2005/Atom",
"xmlns:thr=",
"http://purl.org/syndication/thread/1.0",
"http://www.intertwingly.net/blog/index.atom",
"http://www.intertwingly.net/blog/index.atom",
"T20:30:05-04:00",
"tag:intertwingly.net,2004:2340",
"thr:count=",
"thr:when=",
"T20:30:01-04:00",
"http://www.w3.org/1999/xhtml",
"http://www.tbray.org/ongoing/When/200x/2006/07/07/With-Bloglines-to-Atom",
"http://www.w3.org/1999/xhtml",
"http://www.tbray.org/ongoing/When/200x/2006/07/07/With-Bloglines-to-Atom",
"http://www.w3.org/2000/svg",
"float:right",
"out:",
"http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.link",
"http://www.atomenabled.org/developers/syndication/atom-format-spec.php#rfc.section.3.1.1",
"http://www.w3.org/TR/2001/REC-xmlbase-20010627/",
"xml:base",
"http://www.bloglines.com/preview?siteid=235142",
"http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.author",
"http://www.bloglines.com/preview?siteid=235142",
"http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.source",
"http://www.bloglines.com/preview?siteid=5319444",
"http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.updated",
"http://www.bloglines.com/preview?siteid=2375595",
"http://www.bloglines.com/preview?siteid=50",
"http://www.bloglines.com/preview?siteid=2438392",
"http://weblog.philringnalda.com/2005/12/18/who-knows-a-title-from-a-hole-in-the-ground",
"http://www.niallkennedy.com/blog/archives/2006/07/google-sitemaps-2.html",
"http://www.stephenduncanjr.com/2006/06/atom-10-and-blogger.shtml",
"T18:06:55-04:00",
"tag:intertwingly.net,2004:2339",
"thr:count=",
"thr:when=",
"T12:45:00-04:00",
"http://www.w3.org/1999/xhtml",
"http://www.1060.org/blogxter/entry?publicid=8A0DC194929914711F1C0470FFDB7B73",
"http://www.intertwingly.net/slides/2005/xmlconf/",
"http://www.intertwingly.net/slides/2005/etcon/",
"T21:07:59-04:00",
"tag:intertwingly.net,2004:2338",
"thr:count=",
"thr:when=",
"T19:56:01-04:00",
"http://www.w3.org/1999/xhtml",
"http://www.w3.org/2000/svg'",
"float:right",
"http://en.wikipedia.org/wiki/Penrose_tiling",
"http://intertwingly.net/stories/2006/07/06/penroseTiling.svg",
"T17:55:35-04:00",
"tag:intertwingly.net,2004:2337",
"thr:count=",
"thr:when=",
"T08:45:19-04:00",
"http://www.w3.org/1999/xhtml",
"http://www.w3.org/2000/svg",
"float:right",
"http://www.unto.net/unto/work/on-rss-and-atom/",
"http://www.unto.net/unto/opensearch/more-on-rss-and-atom/",
"T12:15:13-04:00",
"T21:19:04-04:00",
"tag:intertwingly.net,2004:2336",
"thr:count=",
"thr:when=",
"T22:27:59-04:00",
"http://www.w3.org/1999/xhtml",
"http://intertwingly.net/stories/2006/07/04/clean_utf8_for_xml.c",
"T08:59:42-04:00"]Here’s the diffs:
(uri_result - gentle_uri_result)
=> ["xmlns:thr=",
"T20:30:05-04:00",
"thr:count=",
"thr:when=",
"T20:30:01-04:00",
"float:right",
"out:",
"xml:base",
"T18:06:55-04:00",
"thr:count=",
"thr:when=",
"T12:45:00-04:00",
"T21:07:59-04:00",
"thr:count=",
"thr:when=",
"T19:56:01-04:00",
"http://www.w3.org/2000/svg'",
"float:right",
"T17:55:35-04:00",
"thr:count=",
"thr:when=",
"T08:45:19-04:00",
"float:right",
"T12:15:13-04:00",
"T21:19:04-04:00",
"thr:count=",
"thr:when=",
"T22:27:59-04:00",
"T08:59:42-04:00"]
(gentle_uri_result - uri_result)
=> [".",
"2006/07/08/Bloglines-Edge-Cases",
"2340.atom",
"2006/07/06/Blame-Somebody",
"2339.atom",
"2006/07/06/Penrose-Tiling",
"2338.atom",
"2006/07/04/Just-a-Technical-Detail",
"2337.atom",
"2006/07/04/Clean-utf-8-for-XML",
"2336.atom"]The extract code was designed to work especially well with SGMLish text and Textile-formatted text. The regular expressions should work perfectly with BBCode and Markdown as well, though I haven’t tried it.
I do admit that I totally cheated and threw out basically all of those false-positives specifically for this example, but i’ll probably also be expanding the rejection list as time goes on, since it’s a fairly lightweight check. Good enough for my purposes anyhow.