python3 urllib 模块（python的urllib模块）

25-02-05 21

这篇文章主要围绕python3urllib模块和python的urllib模块展开，旨在为您提供一份详细的参考资料。我们将全面介绍python3urllib模块的优缺点，解答python的urllib

这篇文章主要围绕python3 urllib 模块和python的urllib模块展开，旨在为您提供一份详细的参考资料。我们将全面介绍python3 urllib 模块的优缺点，解答python的urllib模块的相关问题，同时也会为您带来Python urllib URL 处理模块、Python urllib2 模块、python urllib模块、Python 核心模块 ——urllib 模块的实用方法。

本文目录一览：

python3 urllib 模块（python的urllib模块）
Python urllib URL 处理模块
Python urllib2 模块
python urllib模块
Python 核心模块 ——urllib 模块

python3 urllib 模块（python的urllib模块）

3.0 版本中已经将 urllib2、urlparse、和 robotparser 并入了 urllib 中，并且修改 urllib 模块，其中包含 5 个子模块，即是 help () 中看到的那五个名字。

Python2 中的 urllib 模块，在 Python3 中被修改为

20.5. urllib.request — Extensible library for opening URLs
20.6. urllib.response — Response classes used by urllib
20.7. urllib.parse — Parse URLs into components
20.8. urllib.error — Exception classes raised by urllib.request
20.9. urllib.robotparser — Parser for robots.txt

这几个模块，常用的 urllib.urlopen () 方法变成了 urllib.request.urlopen () 方法，其它方法的改变，可以参考 Python3 的文档

Python3 文档的互联网协议与支持部分：http://docs.python.org/py3k/library/internet.html

Python2 使用库：

urllib http://docs.python.org/library/urllib.html【下载】

urllib2 http://docs.python.org/library/urllib2.html【抓取】

urlparse http://docs.python.org/library/urlparse.html【url 切分用到】

sgmllib http://docs.python.org/library/sgmllib.html【html 解析用到】

# Python urllib2递归抓取某个网站下图片
#!/usr/bin/python
# -*- coding:utf-8 -*-
# author: wklken
# 2012-03-17 wklken@yeah.net
#1实现url解析 #2实现图片下载 #3优化重构
#4多线程 尚未加入

import os,sys,urllib,urllib2,urlparse
from sgmllib import SGMLParser 

img = []
class URLLister(SGMLParser):
  def reset(self):
    SGMLParser.reset(self)
    self.urls=[]
    self.imgs=[]
  def start_a(self, attrs):
    href = [ v for k,v in attrs if k=="href" and v.startswith("http")]
    if href:
      self.urls.extend(href)
  def start_img(self, attrs):
    src = [ v for k,v in attrs if k=="src" and v.startswith("http") ]
    if src:
      self.imgs.extend(src)


def get_url_of_page(url, if_img = False):
  urls = []
  try:
    f = urllib2.urlopen(url, timeout=1).read()
    url_listen = URLLister()
    url_listen.feed(f)
    if if_img:
      urls.extend(url_listen.imgs)
    else:
      urls.extend(url_listen.urls)
  except urllib2.URLError, e:
    print e.reason
  return urls

#递归处理页面
def get_page_html(begin_url, depth, ignore_outer, main_site_domain):
  #若是设置排除外站 过滤之
  if ignore_outer:
    if not main_site_domain in begin_url:
      return

  if depth == 1:
    urls = get_url_of_page(begin_url, True)
    img.extend(urls)
  else:
    urls = get_url_of_page(begin_url)
    if urls:
      for url in urls:
        get_page_html(url, depth-1)

#下载图片
def download_img(save_path, min_size):
  print "download begin..."
  for im in img:
    filename = im.split("/")[-1]
    dist = os.path.join(save_path, filename)
    #此方式判断图片的大小太浪费了
    #if len(urllib2.urlopen(im).read()) < min_size:
    #  continue
    #这种方式先拉头部，应该好多了，不用再下载一次
    connection = urllib2.build_opener().open(urllib2.Request(im))
    if int(connection.headers.dict[''content-length'']) < min_size:
      continue
    urllib.urlretrieve(im, dist,None)
    print "Done: ", filename
  print "download end..."

if __name__ == "__main__":
  #抓取图片首个页面
  url = "http://www.baidu.com/"
  #图片保存路径
  save_path = os.path.abspath("./downlaod")
  if not os.path.exists(save_path):
    os.mkdir(save_path)
  #限制图片最小必须大于此域值  单位 B
  min_size = 92
  #遍历深度
  max_depth = 1
  #是否只遍历目标站内，即存在外站是否忽略
  ignore_outer = True
  main_site_domain = urlparse.urlsplit(url).netloc

  get_page_html(url, max_depth, ignore_outer, main_site_domain)

  download_img(save_path, min_size)

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

import time
import sys
import gzip
import socket
import urllib.request, urllib.parse, urllib.error
import http.cookiejar
class HttpTester:
def __init__(self, timeout=10, addHeaders=True):
socket.setdefaulttimeout(timeout) # 设置超时时间
self.__opener = urllib.request.build_opener()
urllib.request.install_opener(self.__opener)
if addHeaders: self.__addHeaders()
def __error(self, e):
''''''错误处理''''''
print(e)
def __addHeaders(self):
''''''添加默认的 headers.''''''
self.__opener.addheaders = [(''User-Agent'', ''Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0''),
(''Connection'', ''keep-alive''),
(''Cache-Control'', ''no-cache''),
(''Accept-Language:'', ''zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3''),
(''Accept-Encoding'', ''gzip, deflate''),
(''Accept'', ''text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'')]
def __decode(self, webPage, charset):
''''''gzip解压，并根据指定的编码解码网页''''''
if webPage.startswith(b''x1fx8b''):
return gzip.decompress(webPage).decode(charset)
else:
return webPage.decode(charset)
def addCookiejar(self):
''''''为 self.__opener 添加 cookiejar handler。''''''
cj = http.cookiejar.CookieJar()
self.__opener.add_handler(urllib.request.HTTPCookieProcessor(cj))
def addProxy(self, host, type=''http''):
''''''设置代理''''''
proxy = urllib.request.ProxyHandler({type: host})
self.__opener.add_handler(proxy)
def addAuth(self, url, user, pwd):

''''''添加认证''''''
pwdMsg = urllib.request.HTTPPasswordMgrWithDefaultRealm()
pwdMsg.add_password(None, url, user, pwd)
auth = urllib.request.HTTPBasicAuthHandler(pwdMsg)
self.__opener.add_handler(auth)

def get(self, url, params={}, headers={}, charset=''UTF-8''):
''''''HTTP GET 方法''''''
if params: url += ''?'' + urllib.parse.urlencode(params)
request = urllib.request.Request(url)
for k,v in headers.items(): request.add_header(k, v) # 为特定的 request 添加指定的 headers
try:
response = urllib.request.urlopen(request)
except urllib.error.HTTPError as e:
self.__error(e)
else:
return self.__decode(response.read(), charset)


def post(self, url, params={}, headers={}, charset=''UTF-8''):
''''''HTTP POST 方法''''''
params = urllib.parse.urlencode(params)
request = urllib.request.Request(url, data=params.encode(charset)) # 带 data 参数的 request 被认为是 POST 方法。
for k,v in headers.items(): request.add_header(k, v)
try:
response = urllib.request.urlopen(request)
except urllib.error.HTTPError as e:
self.__error(e)
else:
return self.__decode(response.read(), charset)
def download(self, url, savefile):
''''''下载文件或网页''''''
header_gzip = None
for header in self.__opener.addheaders: # 移除支持 gzip 压缩的 header
if ''Accept-Encoding'' in header:
header_gzip = header
self.__opener.addheaders.remove(header)
__perLen = 0


def reporthook(a, b, c): # a:已经下载的数据大小; b:数据大小; c:远程文件大小;
if c > 1000000:
nonlocal __perLen
per = (100.0 * a * b) / c
if per>100: per=100
per = ''{:.2f}%''.format(per)
print(''b''*__perLen, per, end='''') # 打印下载进度百分比
sys.stdout.flush()
__perLen = len(per)+1
print(''--> {}t''.format(url), end='''')
try:
urllib.request.urlretrieve(url, savefile, reporthook) # reporthook 为回调钩子函数，用于显示下载进度
except urllib.error.HTTPError as e:
self.__error(e)
finally:
self.__opener.addheaders.append(header_gzip)
print()

二、应用实例
在OSC上动弹一下
ht = HttpTester()

ht.addCookiejar()
# 为了隐私，把有些关键字隐藏了
ht.get(''https://www.oschina.net/home/login?goto_page=http%3A%2F%2Fwww.oschina.net%2F'')
ht.post(url = ''https://www.oschina.net/action/user/hash_login'',
params = {''email'': ''****@foxmail.com'',''pwd'': ''e4a1425583d37fcd33b9*************'',''save_login'': ''1''})#密码哈希，Firefox开发工具抓取的

ht.get(''http://www.oschina.net/'')
ht.post(url = ''http://www.oschina.net/action/tweet/pub'',
params = {''user_code'': ''8VZTqhkJOqhnuugHvzBtME4***********'',''user'': ''102*****'',''msg'': ''大家在动弹什么？ via:(python3, urllib) ->{t}''.format(t = time.time())})
金山快盘签到送空间
ht = HttpTester()
ht.addCookiejar()
# 为了隐私，把有些关键字隐藏
ht.get(''https://www.kuaipan.cn/account_login.htm'')
ht.post(url=''https://www.kuaipan.cn/index.php?ac=account&op=login'',params={''username'': ''****@qq.com'',''userpwd'': ''lyb********'',''isajax'': ''yes''})
ht.get(''http://www.kuaipan.cn/index.php?ac=zone&op=taskdetail'')
ht.get(''http://www.kuaipan.cn/index.php?ac=common&op=usersign'')

Python urllib URL 处理模块

包括网页请求、响应获取、代理和cookie设置、异常处理、URL解析等功能的Python模块

源代码: Lib/urllib/

urllib 是一个收集了多个用到 URL 的模块的包：

urllib.request 打开和读取 URL
urllib.error 包含 urllib.request 抛出的异常
urllib.parse 用于解析 URL
urllib.robotparser 用于解析 robots.txt 文件

urllib.request

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url：打开统一资源定位地址 url，可以是一个字符串或一个 Request 对象。

data：发送到响应服务器的其他数据的对象（eg：POST方式的数据包），默认为None

timeout：以秒为单位，用于超时连接的断开操作，只适用于HTTP、HTTPS、FTP连接。

cafile：包含CA证书的单个文件

capath：hash后的证书文件的目录路径

context：描述各种SSL选项的ssl.SSLContext实例

urllib.request.install_opener(opener)

将OpenerDirector实例安装为默认的全局启动器。

`urllib.request.build_opener`([handler, ...])

返回一个OpenerDirector实例，该实例按给定的顺序链接处理程序。handler可以是的实例，也可以是的BaseHandler子类BaseHandler（在这种情况下，必须可以不带任何参数地调用构造函数）。

`urllib.request.pathname2url`(path)

将路径名路径从路径的本地语法转换为URL的路径组件中使用的形式。

`urllib.request.url2pathname`(path)

将路径组件路径从百分比编码的URL 转换为路径的本地语法。

`urllib.request.getproxies`()

此辅助函数将方案字典返回到代理服务器URL映射。

class `urllib.request.Request`(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

url 是一个含有一个有效的统一资源定位地址的字符串。

class urllib.request. OpenerDirector: 本OpenerDirector类打开通过URL BaseHandler链接在一起。它管理处理程序的链接以及从错误中恢复。

class urllib.request. BaseHandler: 这是所有注册处理程序的基类---并且仅处理简单的注册机制。

class urllib.request. HTTPDefaultErrorHandler: 定义HTTP错误响应的默认处理程序的类；所有的回应都变成了HTTPError例外。

class urllib.request. HTTPRedirectHandler: 一个用于处理重定向的类。

class urllib.request. HTTPCookieProcessor (cookiejar=None): 一个用于处理 HTTP Cookies 的类。

class urllib.request. ProxyHandler (proxies=None)

使请求通过代理。如果给出了代理，则它必须是将协议名称映射到代理URL的字典。

参考资料：

urllib --- URL 处理模块

Python urllib2 模块

urllib2.urlopen(url, data=None, timeout=<object object>) ：用于打开一个URL，URL可以是一个字符串也可以是一个请求对象，data 用于指定要发送到服务器的额外数据的字符串，timeout 用于设置打开URL的超时时间

In [1]: import urllib2
In [2]: request = urllib2.urlopen(''http://www.baidu.com/'')    # 结果返回一个文件对象
In [3]: data = request.read()    # 使用文件对象的read()方法可以读取数据，也可以readline()、readlines()等方法

In [1]: import urllib2
In [2]: url = ''http://www.baidu.com/''
In [3]: headers = {''User-Agent'': ''Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36''}
In [4]: request = urllib2.Request(url, headers=headers)    # 也可以先构造一个请求对象
In [5]: response = urllib2.urlopen(request)    # 然后使用 urlopen() 来打开这个请求对象
In [6]: data = response.read()

urllib2.Request(url, data, headers) ：用于构造一个请求对象，然后用 urllib2.urlopen() 来打开这个请求对象，data 用于指定要发送到服务器的额外数据的字符串，headers 用于指定请求头，请求头可以在浏览器按 F12 查看

In [1]: import urllib2
In [2]: url = ''http://www.baidu.com/''    # User-Agent 表示使用哪个浏览器打开
In [3]: headers = {''User-Agent'': ''Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36''}
In [4]: request = urllib2.Request(url, headers=headers)    
In [5]: response = urllib2.urlopen(request)
In [6]: data = response.read()

urllib2.URLError ：这是一个异常类，如果我们使用 urlopen() 打开一个URL，打开失败就会抛出这个异常，失败的原因主要有：没有网络连接、服务器连接失败、找不到指定的服务器
urllib2.HTTPError ：这是 URLError 异常类的子类，在你利用 urlopen() 方法发送一个请求时，服务器会响应并返回请求的内容，使用 urllib2.HTTPError 可以获取返回的请求头中的 HTTP 状态码
urllib2.HTTPError 这个类包含了 code 属性，urllib2.URLError 这个类包含了 code 和 reason 属性，code 即 HTTP 状态码，如 200，403，502 等，reason 用于描述失败的原因，一般我们只使用 urllib2.URLError 这个异常类

import urllib2

try:
    urllib2.urlopen(''http://blog.csdn.net/cqcrek'')
except urllib2.URLError, e:
    if hasattr(e, ''code''):
        print ''连接服务器失败，错误代码：%s'' % e.code
    if hasattr(e, ''reason''):
        print ''连接服务器失败，失败原因：%s'' % e.reason
    else:
        print ''连接服务器失败，失败原因：%s'' % e
else:
    print ''OK''

python urllib模块

在python中urllib模块提供上层接口,可以使用它下载读取数据,这里举个例子,把sina首页的html抓取下来显示出来.有2种方法可以实现.

1.urlopen(url, data=None, proxies=None)
urlopen(url [, data]) -> open file-like object

创建一个表示远程url的类文件对象，然后像本地文件一样操作这个类文件对象来获取远程数据。参数url表示远程数据的路径，一般是网址；参数data表示以post方式提交到url的数据;参数proxies用于设置代理.urlopen返回一个类文件对象.

#!/usr/bin/python2.5
import urllib

url = "http://www.sina.com"
data = urllib.urlopen(url).read()
print data

root@10.1.6.200:~# python gethtml.py 
<!Doctype html>
<!--[30,131,1] published at 2013-04-11 23:15:33 from #150 by system-->
<html>
<head>
    <meta http-equiv="Content-type" content="text/html; charset=gb2312" />
    <title>тK˗ҳ</title>

	<meta name="keywords" content="тK,тKθ,SINA,sina,sina.com.cn,тK˗ҳ,ą»§,؊Ѷ" />
....

2 urlretrieve(url, filename=None, reporthook=None, data=None)

urlretrieve方法直接将远程数据下载到本地。参数filename指定了保存到本地的路径(如果未指定该参数，urllib会生成一个临时文件来保存数据)；参数reporthook是一个回调函数,当连接上服务器.以及相应的数据块传输完毕的时候会触发该回调.

#!/usr/bin/python2.5
import urllib

url = "http://www.sina.com"
path = "/root/sina.txt"
data = urllib.urlretrieve(url,path)

root@10.1.6.200:~# python getsina.py 
root@10.1.6.200:~# cat sina.txt 
<!Doctype html>
<!--[30,131,1] published at 2013-04-11 23:25:30 from #150 by system-->
<html>
<head>
    <meta http-equiv="Content-type" content="text/html; charset=gb2312" />
    <title>тK˗ҳ</title>

	<meta name="keywords" content="тK,тKθ,SINA,sina,sina.com.cn,тK˗ҳ,ą»§,؊Ѷ" />
....

不仅如此,这里写个爬虫小程序,可以把百度贴吧http://tieba.baidu.com/p/2236567282网页上的jpg图片依次下载下来.

root@10.1.6.200:~# cat getJpg.py 
#!/usr/bin/python2.5

import re
import urllib

def getHtml(url):
    html = urllib.urlopen(url).read()
    return html

def getJpg(html):
    reg = r''src="(http://.*?\.jpg)"''
    imgre = re.compile(reg)
    imgList = re.findall(imgre,html)
    x = 0
    for imgurl in imgList:
        urllib.urlretrieve(imgurl,''%s.jpg'' % x)
        x += 1       

html = getHtml("http://tieba.baidu.com/p/2236567282")
getJpg(html)

root@10.1.6.200:~# python 11.py 
root@10.1.6.200:~# ls -l
total 1680
-rw-r--r-- 1 root root  38695 2013-04-11 23:32 0.jpg
-rw-r--r-- 1 root root  48829 2013-04-11 23:32 10.jpg
-rw-r--r-- 1 root root  51835 2013-04-11 23:32 11.jpg
-rw-r--r-- 1 root root  41688 2013-04-11 23:32 12.jpg
-rw-r--r-- 1 root root   1077 2013-04-11 23:32 13.jpg
-rw-r--r-- 1 root root  33989 2013-04-11 23:32 14.jpg
-rw-r--r-- 1 root root  41890 2013-04-11 23:32 15.jpg
-rw-r--r-- 1 root root  35728 2013-04-11 23:32 16.jpg
-rw-r--r-- 1 root root  44405 2013-04-11 23:32 17.jpg
-rw-r--r-- 1 root root  29847 2013-04-11 23:32 18.jpg
-rw-r--r-- 1 root root  44607 2013-04-11 23:32 19.jpg
-rw-r--r-- 1 root root  23939 2013-04-11 23:32 1.jpg
-rw-r--r-- 1 root root  45592 2013-04-11 23:32 20.jpg
-rw-r--r-- 1 root root  60910 2013-04-11 23:32 2.jpg
-rw-r--r-- 1 root root  39014 2013-04-11 23:32 3.jpg
-rw-r--r-- 1 root root  19057 2013-04-11 23:32 4.jpg
-rw-r--r-- 1 root root  64584 2013-04-11 23:32 5.jpg
-rw-r--r-- 1 root root  29297 2013-04-11 23:32 6.jpg
-rw-r--r-- 1 root root  39145 2013-04-11 23:32 7.jpg
-rw-r--r-- 1 root root   1059 2013-04-11 23:32 8.jpg
-rw-r--r-- 1 root root  44797 2013-04-11 23:32 9.jpg

Python 核心模块 ——urllib 模块

urllib 模块中的方法

1.urllib.urlopen(url[,data[,proxies]])

打开一个 url 的方法，返回一个文件对象，然后可以进行类似文件对象的操作。本例试着打开 google

>>> import urllib
>>> f = urllib.urlopen(''http://www.google.com.hk/'')
>>> firstLine = f.readline()   #读取html页面的第一行
>>> firstLine
''<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage"><head><meta content="/images/google_favicon_128.png" itemprop="image"><title>Google</title><script>(function(){\n''

urlopen 返回对象提供方法：

- read () , readline () ,readlines () , fileno () , close () ：这些方法的使用方式与文件对象完全一样

- info ()：返回一个 httplib.HTTPMessage 对象，表示远程服务器返回的头信息

- getcode ()：返回 Http 状态码。如果是 http 请求，200 请求成功完成；404 网址未找到

- geturl ()：返回请求的 url

2.urllib.urlretrieve(url[,filename[,reporthook[,data]]])

urlretrieve 方法将 url 定位到的 html 文件下载到你本地的硬盘中。如果不指定 filename，则会存为临时文件。

urlretrieve () 返回一个二元组 (filename,mine_hdrs)

>>> filename = urllib.urlretrieve(''http://www.google.com.hk/'')
>>> type(filename)
<type ''tuple''>
>>> filename[0]
''/tmp/tmp8eVLjq''
>>> filename[1]
<httplib.HTTPMessage instance at 0xb6a363ec>

>>> filename = urllib.urlretrieve(''http://www.google.com.hk/'',filename=''/home/dzhwen/python文件/Homework/urllib/google.html'')
>>> type(filename)
<type ''tuple''>
>>> filename[0]
''/home/dzhwen/python\xe6\x96\x87\xe4\xbb\xb6/Homework/urllib/google.html''
>>> filename[1]
<httplib.HTTPMessage instance at 0xb6e2c38c>

3.urllib.urlcleanup()

清除由于 urllib.urlretrieve () 所产生的缓存

4.urllib.quote (url) 和 urllib.quote_plus (url)

将 url 数据获取之后，并将其编码，从而适用与 URL 字符串中，使其能被打印和被 web 服务器接受。

>>> urllib.quote(''http://www.baidu.com'')
''http%3A//www.baidu.com''
>>> urllib.quote_plus(''http://www.baidu.com'')
''http%3A%2F%2Fwww.baidu.com''

5.urllib.unquote (url) 和 urllib.unquote_plus (url)

与 4 的函数相反。

6.urllib.urlencode(query)

将 URL 中的键值对以连接符 & 划分

这里可以与 urlopen 结合以实现 post 方法和 get 方法：

GET 方法：

>>> import urllib
>>> params=urllib.urlencode({''spam'':1,''eggs'':2,''bacon'':0})
>>> params
''eggs=2&bacon=0&spam=1''
>>> f=urllib.urlopen("http://python.org/query?%s" % params)
>>> print f.read()

POST 方法：

>>> import urllib
>>> parmas = urllib.urlencode({''spam'':1,''eggs'':2,''bacon'':0})
>>> f=urllib.urlopen("http://python.org/query",parmas)
>>> f.read()

关于python3 urllib 模块和python的urllib模块的介绍已经告一段落，感谢您的耐心阅读，如果想了解更多关于Python urllib URL 处理模块、Python urllib2 模块、python urllib模块、Python 核心模块 ——urllib 模块的相关信息，请在本站寻找。

本文标签：