Python使用scrapy抓取网站sitemap信息的方法

1107次阅读  |  发布于5年以前

本文实例讲述了Python使用scrapy抓取网站sitemap信息的方法。分享给大家供大家参考。具体如下:


    import re
    from scrapy.spider import BaseSpider
    from scrapy import log
    from scrapy.utils.response import body_or_str
    from scrapy.http import Request
    from scrapy.selector import HtmlXPathSelector
    class SitemapSpider(BaseSpider):
     name = "SitemapSpider"
     start_urls = ["http://www.domain.com/sitemap.xml"]
     def parse(self, response):
      nodename = 'loc'
      text = body_or_str(response)
      r = re.compile(r"(<%s[\s>])(.*?)(</%s>)"%(nodename,nodename),re.DOTALL)
      for match in r.finditer(text):
       url = match.group(2)
       yield Request(url, callback=self.parse_page)
     def parse_page(self, response):
        hxs = HtmlXPathSelector(response)
        #Mock Item
      blah = Item()
      #Do all your page parsing and selecting the elemtents you want
        blash.divText = hxs.select('//div/text()').extract()[0]
      yield blah

希望本文所述对大家的Python程序设计有所帮助。

Copyright© 2013-2020

All Rights Reserved 京ICP备2023019179号-8