Python使用urllib2模块抓取HTML页面资源的实例分享

350次阅读  |  发布于5年以前

先把要抓取的网络地址列在单独的list文件中


    http://www.jb51.net/article/83440.html
    http://www.jb51.net/article/83437.html
    http://www.jb51.net/article/83430.html
    http://www.jb51.net/article/83449.html

然后我们来看程序操作,代码如下:


    #!/usr/bin/python

    import os
    import sys
    import urllib2
    import re

    def Cdown_data(fileurl, fpath, dpath):
     if not os.path.exists(dpath):
      os.makedirs(dpath)
     try:
      getfile = urllib2.urlopen(fileurl) 
      data = getfile.read()
      f = open(fpath, 'w')
      f.write(data)
      f.close()
     except:
     print 

    with open('u1.list') as lines:
     for line in lines:
      URI = line.strip()
      if '?' and '%' in URI:
       continue
     elif URI.count('/') == 2:
       continue
      elif URI.count('/') > 2:
       #print URI,URI.count('/')
      try:
        dirpath = URI.rpartition('/')[0].split('//')[1]
        #filepath = URI.split('//')[1].split('/')[1]
        filepath = URI.split('//')[1]
       if filepath:
         print URI,filepath,dirpath
         Cdown_data(URI, filepath, dirpath)
       except:
        print URI,'error'

Copyright© 2013-2020

All Rights Reserved 京ICP备2023019179号-8