my favorite

2011/06/18

NodeJS + YUI3: Crawling Web via CSS selector

Story:

YQL is a killer tool to crawl web data, we can fetch remote pages and get json/jsonp/xml output via YQL console.
The magic behind it's using xpath like we crawl web in tranditional way.
However, to be a frontend engineer, to "select" desired DOM elements is a dead simple things via CSS selector and that's what Dav's talk inspire me a lot.


YUI3 + NodeJS

For details about node.js I strongly recommend to read it here: http://nodejs.org
To me it's so interesting to deal w/ DOM stuff at server side. YUI3 (and jsdom) enpower NodeJS fetch pages in a easy(maybe frontend-friendly :P) way.


Demo:

1. To crawl all link titles in hacker news
http://hapi.nodester.com/api?host=http://news.ycombinator.com&rule=table%20td.title%20a&callback=hackernews

2. To crawl lady gaga information title in yahoo search
http://hapi.nodester.com/api?host=http://search.yahoo.com/search?p=lady+gaga&fr=sfp&fr2=&iscqry=&rule=%23web%20.res%20.sm-media%20img&callback=ladygaga

3. To crawl iOS/Android app and get all app icon images in yahoo app search
http://hapi.nodester.com/api?host=http://apps.search.yahoo.com/search?p=plants&fr=apps_sfp&fr2=&iscqry=&rule=%23main%20.app-res%20.left%20img&callback=yappsearch

To link above you can get a jsonp object and use it in your webpage


Future work:

There are still a lot of rooms to improve this prototype,
in case some websites needed additional http headers such as Referer or Cookie is not yet supported.
Moreover, to avoid of confusion we have to encode '#' (sharp?) to %23 since it's a valid character in URL and it's a identifier selector in css rule.

Reference

1. Dav Glass — Using Node.js and YUI 3
http://developer.yahoo.com/yui/theater/video.php?v=glass-node

2. Dav Glass — Node.js + YUI 3 (YUIConf 2010)
http://developer.yahoo.com/yui/theater/video.php?v=yuiconf2010-glass

No comments:

BIO

Taipei, GuTing, Taiwan

huang47 | personal

huang47 | personal