Ñ§Ï°Íø¿¼ÊÔѧϰ×ÊÁÏ

Gzu521.com

ÍøÂçÅÀ³æ³ÌÐò

Java½Ì³Ì   µã»÷£º´Î   ·¢²¼Ê±¼ä£º2005-9-9   ¡¾×ÖÌ壺´ó ÖРС¡¿   À´Ô´£ºGzu521.com
¹ó ÖÝ Ñ§ ϰ Íø

×î½ü¸úÅóÓÑ¿ªÊ¼Ñо¿ËÑË÷ÒýÇæµÄʵÏÖ£¬ÏÂÃæÊÇÅóÓѲÎÕÕjobo¸ÄµÄÒ»¸ö¸øÓèJavaµÄspider

ps£ºÀïÃæµÄÓ¢ÎÄ¿ÉÒÔÂÔ¹ýÁË£¬Õâ¼Ò»ïÓ¢ÓïˮƽʵÔÚÊÇÀÃ= = UuY@U jvj@?€`C$4 [ ±¾ ×Ê ÁÏ À´ Ô´ ÓÚ ¹ó ÖÝ Ñ§ ϰ Íø ÍøÂç±à³ÌJava½Ì³Ì http://Www.gzU521.com ] UuY@U jvj@?€`C$4

      sosoo 1.0ÍøÂçÅÀ³æ³ÌÐò
---Óû§¿ª·¢ÊÖ²á
±àдÈË£ºÍõ½¨»ª£¨rimen/jerry£©
±àдĿµÄ£º»ùÓÚsosoo¶¨ÖÆweb spider³ÌÐòµÄ±à³ÌÈËÔ±¡£
                             Ŀ¼
 

Ò»£º°²×°sosoo 2
¶þ£º¹¦Äܶ¨ÖÆ 2
1£®»ù±¾¹¦ÄܲÎÊýµÄÉèÖᣠ2
2£®ÅäÖûúÆ÷È˶ÔurlµÄ¼ì²é 3
3£ºÊµÏÖÎļþ¹ÜÀí¡£ 4
4£®¶¨ÖÆhtmlÎĵµÏÂÔØ¹æÔò¡£ 4
5£®ÉèÖöÔÏÂÔØºóhttpÎĵµµÄ¹ýÂË´¦Àí 5
6£®ÆôÓûúÆ÷ÈËÔËÐÐÆÚ¼à¿Ø¡£ 5
7£®ÆôÓöÔhttpЭÒé·ÖÎöµÄ¼à¿Ø¡£ 5
Èý£ºsosooÃèÊö 6
ËÄ£ºÓ¦Óÿª·¢Ö¸ÄÏ 6
1£®roboterÀ࣬spider³ÌÐòµÄÖ÷Àà¡£ 6
2£®tasklist½Ó¿Ú£¬½â¾ö¶Ô´¦ÀíÈÎÎñµÄ´æ´¢¡£ 7
3£®http¿Í»§¶ËʵÏÖ¡£ 7
4£®ÊµÏÖÍøÒ³´æ´¢»ò´¦Àí 7
5£®ÊµÏÖÔËÐÐÆÚµÄ¼à¿Ø 8
Î壮³ÌÐòÀý×Ó 9
Áù£®²Î¿¼ÒÀÀµ 10
 


Ò»£º°²×°sosoo
sosooÌṩµÄÊÇÒ»¸ösosoo-core.jar°ü£¬ÆäÖв¢Ã»ÓÐÌṩ¿ÉÔËÐеijÌÐòÈë¿Ú¡£Èç¹ûҪʹÓÿÉÒÔͨ¹ýsrcÖеÄexamples½øÐÐÌåÑé¡£¿ª·¢Õß¿ÉÒÔ»ùÓÚÔÚÕâÉÏ¿ª·¢³ö»ùÓÚui,J2EEµÈspider³ÌÐò
 °Ñsosoo-core°ü¼ÓÈëµ½ÄãÓ¦ÓõÄclasspathÖС£
 sosoo Ìṩһ¸öÏß³ÌÀàµÄÀ´´ú±íÒ»¸ö»úÆ÷ÈËcom.sosoo.robot.spider. roboter¡£
 Í¨¹ýÒ»¸ö²âÊÔÀà
   public static void main(string[] args)
    throws exception
   {
    roboter robby = new roboter();
robby.setstarturl(new url("http://10.25.101.173:7001/pa18web/framework/images/framevork_04.gif"));
    robby.setmaxdepth(0);   //ÉèÖÃ×î´óÁ¬½Ó·ÃÎÊÉî¶È
    robby.setsleeptime(0);
robby.setwalktootherhosts(true);
robby.run();      //Æô¶¯
}
 ÕâÑù¾ÍÆô¶¯ÁËÒ»¸öspider³ÌÐò,sosoo½«¸ù¾ÝÄã¶Ôjavabean²ÎÊýµÄÉèÖýøÐжÔÍøÂçÉÏhtml
µÄץȡ.
 Í£Ö¹»úÆ÷ÈË.
robot.stoprobot(); 
¶þ£º¹¦Äܶ¨ÖÆ
sosooÀûÓûص÷µÄ·½Ê½ÊµÏÖÁËaop£¬Óû§¿ÉÒÔͨ¹ý»Øµ÷µÄ·½Ê½×¢Èëjavabean¶ÔÏó£¬ÒÔʵÏÖÆäÍⲿ¹¦ÄÜ
1£®»ù±¾¹¦ÄܲÎÊýµÄÉèÖá£
»ù±¾¹¦ÄÜÖ¸£º roboterµÄÔ­ÉúÀàÐͲÎÊý£¨»òstring£©£¬ÀýÈ磺starturl£¬maxdepthµÈ¡£

ÉèÖÿªÊ¼´¦ÀíµÄurl¡¾starturl¡¿£º±ØÐë²ÎÊý£¬roboter»á¸ù¾ÝÕâ¸öµØÖ·¿ªÊ¼ÔÚÍøÂçÉϽøÐвéÕÒ¡£
robby.setstarturl(url url)£»

ÉèÖôӿªÊ¼Â·¾¶ÄÜÕÒµÄ×î´óÉî¶È¡¾maxdepth¡¿£º³ÌÐò»áÅжϵ±Ç°´¦ÀíµÄÁ´½ÓÉî¶ÈÊÇ·ñ³¬¹ý×î´óÁ´½ÓÉî¶È£¬Èç¹û³¬¹ýÕâ¸öÁ´½Ó½«ºöÂÔ£¬µ±È»Äã¿ÉÒÔͨ¹ýÉèÖÃdepthiseffectÀ´ÆÁ±ÎÕâ¸ö¹¦ÄÜ¡£Ä¬ÈÏֵΪ1¡£
robby.setmaxdepth(0);

ÉèÖô¦ÀíÎĵµµÄʱ¼ä¼ä¸ô¡¾sleeptime¡¿£ºµ±´¦ÀíÍêÒ»¸öurlʱ£¬ÉèÖÃÕâ¸ö¼ä¸ôʱ¼ä´¦ÀíÏÂÒ»¸öurl¡£µ¥Î»Îªs£¬Ä¬ÈÏֵΪ60£¬ÍƼöʹÓÃ5s¡£
robby.setsleeptime(0);
  
    ÉèÖÃhttpÁ¬½Ó³¬Ê±Ê±¼ä£ºÔÚÁ¬½Ó·þÎñÆ÷ʱ£¬¿ÉÄÜÓÉÓÚÍøÂç»òÕß±¾Éí·þÎñµÄÔ­Òò£¬¿É
ÄÜ»á³öÏÖ³¬Ê±µÄÇé¿ö£¬Í¨¹ýÉèÖó¬Ê±Ê±¼äÀ´ÏÞÖÆ¶Ôij¸öurlÁ¬½ÓµÄ×î´óʱ¼ä£¬ÕâÑù¾Í¿ÉÒÔÌá¸ß´¦ÀíµÄËÙ¶È£¬µ«ÊÇÈç¹ûÕâ¸öÖµÉèÖÃ̫С£¬¿ÉÄܺܶàÁ¬½Ó¶¼»áÁ¬½Ó²»µ½£¬½¨ÒéʹÓôóÓÚ30µÄÖµ£¬Ä¬ÈÏΪ60s¡£
robby.seturlconnecttimeout(30);

Í£Ö¹robbyÏ̡߳¾stopit¡¿£ºÄã¿ÉÒÔͨ¹ý¶Ôrobby¶ÔÏó½øÐвÙ×÷À´Í£Ö¹µ±Ç°µÄÏ̡߳£
robby.stoprobot();

ÔÝÍ£robbyÏ̡߳¾sleep¡¿£º¿ÉÒÔ¶Ôµ±Ç°ÔËÐеÄrobbyÏ߳̽øÐÐÔÝÍ£¡£
robby.setsleep(true);
2£®ÅäÖûúÆ÷È˶ÔurlµÄ¼ì²é
  sosoo spider»á¸ù¾ÝÓû§µÄÉèÖðÑÊÕ¼¯µ½urlÁбíÖйýÂ˵ô²»·ûºÏÌõ¼þµÄurl¡£
  ¹ýÂ˵Ä˳Ðò£ºwalktootherhosts-allowwholehost-flexiblehostcheck-
allowwholedomain-Óû§ÌṩurlÁÐ±í¡£

ÉèÖÃÊÇ·ñ·ÃÎÊÆäËûµÄÖ÷»ú¡¾walktootherhosts¡¿£ºÈç¹ûÄãÐèÒª½øÐÐÕû¸ö»¥ÁªÍøµÄËÑË÷£¬¶ø²»ÏÞÓÚÄ㿪ʼÌṩµÄurlÖеÄÖ÷»ú£¬°ÑÕâ¸ö²ÎÊýÉèÖÃΪtrue¡£Ä¬ÈÏΪfalse¡£

robby.setwalktootherhosts(true);


ÉèÖÃÊÇ·ñ·ÃÎʱ¾Ö÷»ú¡¾allowwholehost¡¿£ºÈç¹ûÄãÖ»ÏëÒª¶Ô¿ªÊ¼url´ú±íµÄÖ÷»ú½øÐзÃÎÊ£¬Äã¿ÉÒÔͨ¹ýÉèÖÃÕâ¸ö²ÎÊý½øÐйýÂËurl¡£
robby.setallowwholehost(true);

ÉèÖ÷Çwww¿ªÊ¼µÄÖ÷»ú¡¾flexiblehostcheck¡¿£ºµ±ÄãÌṩµÄ¿ªÊ¼url£¬²»ÊÇÒÔwww¿ªÍ·£¬Äã¿ÉÒÔͨ¹ýÉèÖÃflexiblehostcheck²ÎÊýÀ´´ïµ½·ÃÎʱ¾Ö÷»úµÄÄ¿µÄ¡£
robby.setflexiblehostcheck(true);

ÉèÖÃÊÇ·ñ·ÃÎʱ¾Óò¡¾allowwholedomain¡¿£ºÈç¹ûÄãÖ»ÏëÒª¶Ô¿ªÊ¼url´ú±íµÄÓò½øÐзÃÎÊ£¬Äã¿ÉÒÔͨ¹ýÉèÖÃÏÂÃæÕâ¸ö²ÎÊý½øÐйýÂËurl¡£
robby.setallowwholedomain(true);

ÉèÖÃÒª·ÃÎʵÄurlÁÐ±í¡¾allowedurls¡¿£ºÕâÊÇÒ»¸övectorÀàÐ͵ıäÁ¿£¬Óû§¿ÉÒÔ°ÑÕâЩurl·ÅÔÚÅäÖÃÎļþÖУ¬Óû§¿ÉÒÔÔÚÔËÐÐʱÌṩÕâ¸ö±äÁ¿¡£
robby.setallowedurls(allowed)£»


¶¨ÖÆurl¼ì²é¹æÔò¡¾urlcheck¡¿£ºÓû§³ýÁËͨ¹ýÉÏÃæµÄ¹æÔò½øÐÐurl¹ýÂË£¬Óû§»¹¿ÉÒÔͨ¹ýʵÏÖurlcheck½Ó¿ÚÀ´½øÐÐurl¼ì²â¡£ÏµÍ³ÒѾ­ÌṩÁËregexpurlcheckʵÏÖ¡£Ìṩ¶ÔpropertiesÎļþµÄÖ§³Ö¡£
robby.seturlcheck(check);

ÉèÖÃÊÇ·ñÒªÕÒÍøÕ¾¸úĿ¼ÏµÄ/robot.txtÎļþ¡¾ignorerobotstxt¡¿£ºÓû§¿ÉÒÔͨ¹ýÉèÖÃÕâ¸öÖµÀ´ºöÂÔÍøÕ¾Ìṩ¶ÔrobotµÄÖ§³Ö¡£Ä¬ÈÏΪfalse
robby.setignorerobotstxt(true);

ÉèÖÃurl¿ÉÒÔÖØ¸´·ÃÎÊ¡¾visitmany¡¿£ºÏµÍ³ÌṩÁËÒ»¸ö·ÃÎʹýµÄÁÐ±í»º´æ£¬µ±spider³ÌÐò·¢ÏÖÕâ¸öurlÒѾ­·ÃÎʹý£¬Ëû½«²»½øÐÐÕâ¸öurlµÄ´¦Àí¡£¿ÉÒÔͨ¹ýÕâ¸ö²ÎÊýÀ´¶¨ÖÆ¿ÉÒÔÖØ¸´·ÃÎʵÄurlÁÐ±í£¬ÕâÊÇÒ»¸övectorÀàÐÍ¡£
robby.setvisitmany(visitmany);

ÉèÖÃspider¿Í»§¶ËµÄ¡¾proxy¡¿£ºspider¿ÉÒÔͨ¹ý´úÀíÁ¬½Óinternate£¬µ«ÊÇĿǰֻ֧³ÖÄäÃûµÄ´úÀí·þÎñÆ÷¡£
robby.setproxy("10.16.111.5:80");
3£ºÊµÏÖÎļþ¹ÜÀí¡£
spider³ÌÐòÏÂÔØurl¶ÔÓ¦µÄhtmlÎĵµµÄʱºò£¬ÏµÍ³Ìṩһ¸ö»Øµ÷½Ó¿Úhttpdocmanager¡£Äã¿ÉÒÔͨ¹ýµÄ³Ö½Ó¿ÚµÄʵÏÖ£¬°Ñspider»ñµÃhtmlÊý¾Ý½øÐд洢£¬ÀýÈçÄã¿ÉÒÔ°ÑÕâЩÎļþÒÔÎı¾Á÷´æÈëÊý¾Ý¿â£¬´æÈëÎļþϵͳµÈ¡£ÏµÍ³ÌṩÁËhttpdoctofileʵÏÖ£¬°ÑÏÂÔØµÄÎļþ´æÈëÎļþϵͳ¡£Äã¿ÉÒÔÔÚÆô¶¯spider³ÌÐòµÄʱºòͨ¹ýrobby.setdocmanager(dm);½øÐÐ×¢Èë¹ÜÀí¶ÔÏó¡£
4£®¶¨ÖÆhtmlÎĵµÏÂÔØ¹æÔò¡£
  µ±Äã¶Ô¸÷ÖÖhtmlÎļþµÄ¸ñʽ´¦ÀíÓÐÒªÇóµÄʱºò£¬ÀýÈçÄã¾õµÃÏÂÔØexe,rarÎļþ±È½Ï
  ÂýµÄʱºò£¬Äã¿ÉÒÔͨ¹ý×Ô¼ºµÄÐèÇó¶¨ÖÆÏÂÔØ¹æÔò¡£Í¨¹ýʵÏÖhttpdownloadcheck
½Ó¿Ú¿ÉÒÔÉ趨ÏÂÔØµÄ¹æÔò¡£
downloadruleset rules=new downloadruleset("downrules.properties");
robby.setdownloadruleset(rules);

ϵͳÒѾ­ÌṩÁËdownloadrulesetʵÏÖ£¬Õâ¸öÊÇÏëͨ¹ýclasspath propertiesÀ´¶¨ÒåÏÂÔØ¹æÔò.
ÎļþµÄÄÚÈÝ£º
# the file must contain two field,allow[deny] and mime type/sub type
# allow stand for if the doc match the conditions,will down the file
# deny stand for if the doc match the conditions,will not down the file
# < size ,express the doc content byte size is smaller than the value
# > size ,express the doc contact byte size is larger than the value
# can't hold out the >= or <=
# the scope of size is optional.
allow image/gif  <100000 >10000000
deny image/gif  <100000 >10000000
µ±È»Äã¿ÉÒÔ×Ô¶¨Òå×Ô¼ºµÄʵÏÖ£¬Ö»ÒªÊµÏÖhttpdownloadcheckµÄboolean
downloadallowed(vector httpheaders) ;·½·¨¡£
×¢Ò⣺Èç¹ûÕâ¸öÎĵµÃ»ÓÐÏÂÔØ£¬Õâ¸öÎĵµÖеÄÁ¬½Ó½«²»ÄܽøÐд¦Àí£¬ËùÒÔÒ»°ã²»½¨
Òé¹ýÂ˵ôtext/html.^1}(qLLs}|c^UO}wpe"s[ ´ËÎÄתÌùÓÚÎÒµÄÑ§Ï°ÍøÍøÂç±à³ÌJava½Ì³Ì http://www.Gzu521.com]^1}(qLLs}|c^UO}wpe"s

5£®ÉèÖöÔÏÂÔØºóhttpÎĵµµÄ¹ýÂË´¦Àí
  ÏÂÔØÎĵµºó£¬Óû§¿ÉÒÔ¶ÔÕâ¸ödoc¶ÔÏó½øÐÐһϵÁеĴ¦Àí¡£spiderÌṩÁËÒ»¸ö
filterchainÀà¡£Óû§¿ÉÒÔ°Ñ×Ô¼º¹ýÂËÆ÷¼ÓÈëfilterchainÖС£
Äã¿ÉÒÔʵÏÖdocumentfilter½Ó¿Ú¶¨ÖÆ×Ô¼ºµÄ¹¦ÄÜ£¬ÏµÍ³ÊµÏÖÌṩÁËÒ»¸ölinklocalizerʵÏÖ£¬ÓÃÓÚÌæ»»Ïà¶ÔÁ¬½Ó¡£
    filterchain filters=new filterchain();
    documentfilter filter=new linklocalizer();
    filters.add(filter);
    robby.setfilters(filters); 
6£®ÆôÓûúÆ÷ÈËÔËÐÐÆÚ¼à¿Ø¡£
µ±Æô¶¯spider³ÌÐòºó£¬³ÌÐòÀûÓûص÷½Ó¿ÚÌṩ¸øÓû§ÔËÐÐÆÚµÄ״̬¡£ÀýÈçÄãÒªÒªÏÔʾ»úÆ÷È˶ÔÓ¦µÄ´¦Àí¹ýµÄÈÎÎñ£¬´¦ÀíÖеÄÈÎÎñ£¬´¦ÀíÊǵÄ״̬¡£Í¨¹ýʵÏÖrobotcallback½Ó¿ÚÀ´ÊµÏÖ£¬µ±È»ÄãÒ²¿ÉÒÔÖ±½ÓÈ¡roboter¶ÔÏóµÄÊôÐÔÀ´¼à¿Ø¡£
ϵͳÌṩrobotmonitorʵÏÖ£¬Óû§´òÓ¡ÔÚcosoleÉÏÔËÐÐÆÚµÄ״̬¡£

robotcallback monitor=new monitor();
robby.setwebrobotcallback(monitor);

7£®ÆôÓöÔhttpЭÒé·ÖÎöµÄ¼à¿Ø¡£
  spdierÑ­»·´¦Àí»º´æÖÐδ´¦ÀíµÄurl£¬¶ÔÓÚÿ¸öurlµÄ´¦Àí£¬ÏµÍ³ÌṩһЩ¼à¿Ø·½
·¨¡£µ±È»ÄãֻҪʵÏÖhttptoolcallback½Ó¿Ú¡£ÏµÍ³ÌṩÁËsystemouthttptoolcallbackʵÏÖ¡£
httptoolcallback toolmonitor=new systemouthttptoolcallback();
robby.sethttptoolcallback(toolmonitor);
Èý£ºsosooÃèÊö
 sosooÊDzο¼joboÌṩµÄºËÐÄËã·¨¡£ÀûÓöԷÃÎʵÄÀúÊ·¼Ç¼´æ´¢À´Ìæ»»spiderµÄµÝ¹éËã
·¨¡£ÔÚ´¦ÀíµÄЧÂÊÉÏÓв»´íµÄÌåÑ飬µ«ÊÇËüÒ²ÊÇÎþÉü´æ´¢Îª´ú¼ÛµÄ¡£ÔÚ³ÌÐòÆô¶¯ÊÇ£¬Ïµ
ͳ½«½¨Á¢Á½¸övectorÊý×é¶Ô·ÃÎÊurl½øÐмǼ¡£Òò´Ësosoo²¢²»Ê¹ÓÃÓÚ´óÊý¾ÝÁ¿ÐÅϢץȡ£¬µ«ÊǶÔÓÚÐÐÒµÍøÕ¾£¬ÖÐСÆóÒµ¾ø¶Ô×ã¹»¡£

ĿǰsosooÌṩµÄ´¦Àí¹¦ÄÜ£º
Ö§³Ö¿çÓò¿çÖ÷»úµÄ·ÃÎÊ
Ö§³Ö¶àÖÖÎļþ¸ñʽµÄÏÂÔØ
Ö§³Ö¶ÔhtmlÖÐÁ¬½ÓµÄµÝ¹é´¦Àí
Ö§³Öhttp1.1ЭÒ鵫²»Ö§³Ö1.0
Ö§³ÖÄäÃû´úÀí(http)£¬µ«²»ÖªµÀÐèÒªÑéÖ¤´úÀí¡£

´ýÀ©Õ¹µÄ¹¦ÄÜ£º
Ìṩȫ¹¦ÄÜhttpЭÒé´¦Àí
ÌṩjavasriptµÄÖ§³Ö
Ìṩ±íµ¥fromµÄ´¦ÀíÖ§³Ö
Ìṩ¶ÔFtpЭÒéµÄÖ§³Ö
Ìṩȫ·½Î»µÄ´úÀí£¨http,sockµÈ£©Ö§³Ö¡£
ÍêÉÆÏµÍ³¼à¿Ø¹¦ÄÜ
¼ÓÇ¿¶ÔhtmlÎĵµµÄÐÅÏ¢´¦ÀíÄÜÁ¦
Ìṩ¸÷ÖÖÎļþÀàÐ͵Ĵ¦Àí¹¤¾ß
Ìṩ¶ÔrssµÄÖ§³Ö

ËÄ£ºÓ¦Óÿª·¢Ö¸ÄÏ
 sosooÌṩºÜÇ¿µÄ±à³ÌÀ©Õ¹,ºÜÈÝÒ×°ÑËû¼¯³Éµ½ÄãµÄj2eeÏîÄ¿ÖС£ÔÚÖÐСÐÍËÑË÷ÒýÇæÖУ¬
ÌØ±ð¶ÔÄ³Ò»Ð©ÌØ¶¨µÄÐÐÒµÍøÕ¾µÄÊý¾Ý½øÐзÖÎöʱ£¬sosooÌṩ·½±ãºÍ°²È«µÄ½â¾ö·½°¸¡£
ͨ¹ýÉÏÊö¶Ô¹¦Ä͍ܵ֯£¬ÎÒÃÇ¿ÉÒÔ¿´µ½ÔÚÓ¦ÓÃÖÐÎÒÃǶÔsosooµÄ±à³Ì½Ó¿Ú²¢²»¶à£¬¶øÇÒĿǰϵͳ¶¼ÊÇ»ùÓÚsetµÄ·½Ê½×¢Èëaop×¢Èë¶ÔÏó£¬ÕâÑùºÜÈÝÒ׺ÍspringµÈ»ùÓÚset·½Ê½µÄÒÀÀµ×¢È루ioc£©¿ò¼Ü¼¯³É¡£
1£®roboterÀ࣬spider³ÌÐòµÄÖ÷Àà¡£
µ±ÄãÐèÒª°ÑsosooÓ¦Óõ½ÄãµÄÓ¦ÓÃÖÐʱ£¬roboterÌṩһ¸ö»ùÓÚÏ̵߳Ť¾ßÀà¡£ËüÌå¿ÉÔÚ³ÌÐòÖÐÆô¶¯£¬ÔÝÍ££¬Í˳öÒ»¸öspdier³ÌÐò¡£Õâ¸öÀàÓÐsosooÌṩ£¬²¢²»Ö§³ÖÀ©Õ¹£¬Ëü±êʶspiderµÄÖ÷Ìå¡£ËüÊÇÕû¸öspider³ÌÐòËùÓй¦ÄܵÄÈë¿Ú£¬°üÀ¨»Øµ÷¹¦ÄÜ£¬¶¼ÊÇͨ¹ýset·½Ê½×¢Èëµ½roboterÖС£
com.sosoo.robot.spider.roboter
ÀýÈçÄãÒªÆô¶¯Ò»¸öspiderỊ̈߳º

    roboter robby = new roboter();
robby.setstarturl(new url("http://10.25.101.173:7001/pa18web/framework/images/framevork_04.gif"));
    robby.setmaxdepth(0);   //ÉèÖÃ×î´óÁ¬½Ó·ÃÎÊÉî¶È
    robby.setsleeptime(0);
robby.setwalktootherhosts(true);
robby.run();      //Æô¶¯
2£®tasklist½Ó¿Ú£¬½â¾ö¶Ô´¦ÀíÈÎÎñµÄ´æ´¢¡£
¶ÔÓÚsosoo¶øÑÔ£¬Ã¿¸öurl¶ÔÓÚÒ»¸öÈÎÎñ¡£ÏµÍ³ÒѾ­ÌṩÆäĬÈϵÄʵÏÖ£¬Äã¿ÉÒÔ¸ù¾Ý×Ô¼ºµÄÐèÇóʵÏÖÕâ¸ö½Ó¿Ú¡£È»ºóÔÚÆô¶¯spdierµÄʱºòÓÃregister·½·¨½øÐÐ×¢²á¡£
  robby .registervisitedlist(new hashedmemorytasklist(false));
  robby .registertodolist(new hashedmemorytasklist());
  com.sosoo.robot.spider.tasklist
  Ö÷Òª½â¾öÊǶÔcom.sosoo.robot.spider.robottask¶ÔÏóµÄ´æ´¢¡£ºÍ³£ÓõIJÙ×÷·½
·¨£¬ÀýÈçɾ³ý£¬
Ìí¼Ó£¬²éÕҵȡ£¾ßÌå²Î¼û@javadoc
3£®http¿Í»§¶ËʵÏÖ¡£
Ŀǰ¶Ôsosoo1.0¿Í»§¶Ë¶øÑÔ£¬Ö÷ÒªµÄÓÃ;¾ÍÊÇÌṩģÄâä¯ÀÀÆ÷½øÐжÔhttpÎĵµµÄ»ñÈ¡¡£Í¬Ê±°ÑËûת»»Îªhttpdoc¶ÔÏó¡£Í¬Ê±¶ÔhttpÇëÇó×ÊÔ´µÄ¹ÜÀí£¬ÀýÈçcookie¹ÜÀí¡£
sosooÖ÷ÒªÊÇͨ¹ýcom.sosoo.robot.http.httptool¹¤¾ßÀàÀ´Ìṩ´Ë¹¤ÄÜ¡£ÄãÒ²¿ÉÒÔ×Ô¼º¶¨ÖƸüÓÅ»¯µÄ´¦Àí·½°¸Ìæ»»£¬Í¬Ñùͨ¹ýregister·½·¨½øÐÐ×¢²á¡£ÓÉÓÚÖØÔØÕâ¸ö¹¤¾ßÐèÒª¶ÔhttpЭÒéÓÐÉî¿ÌµÄÁ¬½Ó£¬Ò»°ã²»½¨ÒéÌæ»»ÏµÍ³µÄʵÏÖ¡£
robby.registerhttpparser(new httptool);
4£®ÊµÏÖÍøÒ³´æ´¢»ò´¦Àí
µ±spider³ÌÐòÏÂÔØÍêÒÔºó£¬spider»áÓÃdocmanager¹ÜÀíÆäºÍfilterÀ´´¦ÀíhtmlÎĵµºÍÆäÄÚÈÝ¡£
httpdocmanager½Ó¿ÚÖ÷ÒªÓÃÀ´¶Ôhttpdoc¶ÔÏóµÄ¹ÜÀí£¬ÀýÈç°ÑËü´æ´¢ÔÚÎļþϵͳ£¬»ò´æÈëÊý¾Ý¿âµÈ¡£ÏµÍ³ÌṩÁËʵÏÖhttpdoctofileÓÃÀ´°Ñhttpdoc¶ÔÏó´æ·ÅÔÚÔÚÎļþϵͳÖС£
filterchainÖ÷ÒªÓÃÀ´½øÐÐhttpdocÄÚÈݵÄһϵÁйýÂ˹¦ÄÜ¡£ÀýÈçÄãÖ÷ҪȡÆäÖеÄijЩÐÅÏ¢¡£»òÕßÌæ»»ÆäÖÐijЩÄÚÈÝ¡£ËüÊÇʵÏÖÁËÒ»¸öÊý×é´æ´¢£¬Äã¿ÉÒÔ¸ù¾Ý×Ô¼ºµÄÐèÇó¼ÓÈë¶à¸öʵÏÖdocumentfilter½Ó¿ÚµÄ¶ÔÏó£¬ÏµÍ³ÌṩÁËÒ»¸öʵÏÖlinklocalizer£¬ÓÃÀ´Ìæ»»ÆäÖеÄÏà¶ÔÁ¬½Ó¡£

ϵͳÏȽøÐÐfilter´¦Àí£¬È»ºóÔÚ½øÐÐhttpdocmanager

¾ßÌå±à³Ì²Î¿¼@javadoc

5£®ÊµÏÖÔËÐÐÆÚµÄ¼à¿Ø
sosooÌṩÁËÁ½¸ö¼à¿ØµÄ½Ó¿Ú¡£Óû§¿ÉÒÔʵÏÖÕâд½Ó¿ÚµÄijЩ»òÈ«²¿·½·¨£¬´ïµ½ÔØ
ÔËÐÐÆÚÄÚijЩ״̬µÄ¼à¿Ø¡£
spider¼à¿Ø£ºcom.sosoo.robot.spider.robotcallback
Ö÷ÒªÌṩÎĵµµÄ´¦Àí£¬spiderµÄ˯Ãߣ¬spiderµ±Ç°ÈÎÎñµÄ¼à¿Ø¡£
  void webrobotretrieveddoc(string url, int size);
 //ʵÏÖ¶Ô»ñÈ¡url¶ÔÓ¦µÄhttpdoc¶ÔÏóµÄ¼à¿Ø
  void webrobotupdatequeuestatus(int length);
 //ʵÏÖµ±Ç°´¦ÀíÈÎÎñµÄ¼à¿Ø
  void webrobotdone();
 //´¦ÀíÍê³É
  void webrobotsleeping(boolean sleeping);
 //spiderÔÝÍ£
http·ÖÎö¼à¿Ø£ºcom.sosoo.robot.http.httptoolcallback
/**
   * after initiating a download, this method will be called to
   * inform about the url that will be retrieved
   *  @param url url that will be retrieved now
   */
  void sethttptooldocurl(string url);
 
  /**
   * after httptool got a content-length header
   * this method will be called to inform about the size of
   * the document to retrieve
   * @param size document size in
   */
  void sethttptooldocsize(int size);
 

  /**
   * after a block of bytes was read (default after every 1024 bytes,
   * this method will be called
   * @param size the number of bytes that where retrieved
   */
  void sethttptooldoccurrentsize(int size);

  /**
   * informs about the current status of the httptool
   * @param status an integer describing the current status
   * constants defined in httptool
   * @see httptool
   */
  void sethttptoolstatus(int status);
ϵͳÌṩÁËsystemouthttptoolcallbackĬÈÏʵÏÖ¡£
Î壮³ÌÐòÀý×Ó
package com.sosoo.robot.examples;

/*********************************************
    copyright (c) 2005 by rimen sosoo
*********************************************/

import java.net.url;

import com.sosoo.robot.http.downloadruleset;
import com.sosoo.robot.http.httpdocmanager;
import com.sosoo.robot.http.httpdoctobean;
import com.sosoo.robot.http.httptoolcallback;
import com.sosoo.robot.http.systemouthttptoolcallback;
import com.sosoo.robot.spider.robotcallback;
import com.sosoo.robot.spider.robotmonitor;
import com.sosoo.robot.spider.roboter;
import com.sosoo.robot.spider.docfilter.documentfilter;
import com.sosoo.robot.spider.docfilter.filterchain;
import com.sosoo.robot.spider.docfilter.linklocalizer;

/**
 * this example program downloads a web page. it does not
 * store the documents but only logs the visited urls.
 *
 * @author jerry[wangjianhua] sosoo
 * @version $revision: 1.1 $
 */
public class spidermain {

  public static void main(string[] args)
    throws exception
  {
    system.out.println("urls will be logged to urls.txt\n\n");

    roboter robby =new roboter();
    system.out.println(robby);
    robby.setstarturl(new url("http://www.sina.com.cn/"));
    robby.setmaxdepth(0);
    robby.setdepthiseffect(true);
    robby.setsleeptime(0);
    robby.setignorerobotstxt(true);  
    robby.setwalktootherhosts(true);
   

   
    filterchain filters=new filterchain();
    documentfilter filter=new linklocalizer();
    filters.add(filter);
    //htmlÁ÷¹ýÂËÆ÷
   
    downloadruleset rules=new downloadruleset("downrules.properties");
    httpdocmanager dm = new httpdoctobean();
    //½øÐÐÎĵµ¹ÜÀí£¬¿ÉÒÔ´æ·ÅÔÚÊý¾Ý¿âÒ²¿ÉÒÔ´æ·ÅÔÚ±¾µØ¡£
   
    robotcallback monitor=new robotmonitor();
    httptoolcallback toolmonitor=new systemouthttptoolcallback();
   
    robby.setdocmanager(dm);   
    robby.setdownloadruleset(rules);
    robby.setfilters(filters);
    robby.setwebrobotcallback(monitor);
    robby.sethttptoolcallback(toolmonitor);
  
    robby.run(); //Æô¶¯

  }
}
 
Áù£®²Î¿¼ÒÀÀµ
 jobo spiderʵÏÖ
 tidy html½âÎöÆ÷
 log4jÈÕÖ¾¼Ç¼Æ÷
 apach ÌṩµÄÕýÖµ±í´ïʽ²Ù×÷Æ÷
 
    ÁªÏµÈË£º   Íõ½¨»ª
    qq£º  31580681
    qqȺ£º  1327744£¨¼ÓÈëʱעÃ÷sosoo»òj2ee×ÖÑù£©
    mail:  yityt@hotmail.com 
    blog:  http://jerry_blog.blogcn.com
    <»¶Ó­Ìá³ö±¦¹óµÄÒâ¼û£¬Èç¹ûÓÐÐËȤһÆðÀ©Õ¹ÆäδʵÏֵŦÄÜ>
Èç¹ûÐèÒªÔ´ÂëËæÊ±¿ÉÒÔÁªÏµ¡£

ÔðÈα༭£ºgzu521

ÍøÂç±à³Ì·ÖÀà
ASP½Ì³Ì
.Net½Ì³Ì
Java½Ì³Ì
PHP½Ì³Ì
Êý¾Ý¿â»ù´¡
ACCESS½Ì³Ì
SQL Server½Ì³Ì
MySQL½Ì³Ì
Oracle½Ì³Ì
·ÖÀàÍÆ¼öÐÅÏ¢
¸ü¶à...
´óÀà×îÐÂÎÄÕÂ
¸ü¶à...