NodeJS + YUI3: Crawling Web via CSS selector


YQL is a killer tool to crawl web data, we can fetch remote pages and get json/jsonp/xml output via YQL console.
The magic behind it's using xpath like we crawl web in tranditional way.
However, to be a frontend engineer, to "select" desired DOM elements is a dead simple things via CSS selector and that's what Dav's talk inspire me a lot.

YUI3 + NodeJS

For details about node.js I strongly recommend to read it here:
To me it's so interesting to deal w/ DOM stuff at server side. YUI3 (and jsdom) enpower NodeJS fetch pages in a easy(maybe frontend-friendly :P) way.


1. To crawl all link titles in hacker news

2. To crawl lady gaga information title in yahoo search

3. To crawl iOS/Android app and get all app icon images in yahoo app search

To link above you can get a jsonp object and use it in your webpage

Future work:

There are still a lot of rooms to improve this prototype,
in case some websites needed additional http headers such as Referer or Cookie is not yet supported.
Moreover, to avoid of confusion we have to encode '#' (sharp?) to %23 since it's a valid character in URL and it's a identifier selector in css rule.


1. Dav Glass — Using Node.js and YUI 3

2. Dav Glass — Node.js + YUI 3 (YUIConf 2010)

