requests库+xpath+lxml使用 python中requests库+xpath+lxml简单使用

时间:2021-04-29 顾辞嘤嘤怪人气:0

想了解python中requests库+xpath+lxml简单使用的相关内容吗，顾辞嘤嘤怪在本文为您仔细讲解requests库+xpath+lxml使用的相关知识和一些Code实例，欢迎阅读和指正，我们先划重点：requests库,xpath,lxml使用,requests,,xpath,lxml，下面大家一起来学习吧。

python的requests

它是python的一个第三方库，处理URL比urllib这个库要方便的多，并且功能也很丰富。
【可以先看4，5表格形式的说明，再看前面的】

安装

直接用pip安装，anconda是自带这个库的。

pip install requests

简单使用

requests的文档

1.简单访问一个url：

import requests
url='http://www.baidu.com'
res = requests.get(url)
res.text
res.status_code

<!DOCTYPE html>
<!--STATUS OK-->
<html>
 <head>
<meta http-equiv=content-type content=text/html;charset=utf-8>
<meta http-equiv=X-UA-Compatible content=IE=Edge>
<meta content=always name=referrer>
<link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>
<title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title>
</head> 
<body link=#0000cc> 
<div id=wrapper> 
<div id=head> 
<div class=head_wrapper> 
<div class=s_form> 
<div class=s_form_wrapper>
 <div id=lg> 
<img hidefocus=true src=//www.baidu.com/img/bd_logo1.jpg width=270 height=129> 
</div>
 <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> 
<input type=hidden name=ie value=utf-8> 
<input type=hidden name=f value=8> 
<input type=hidden name=rsv_bp value=1>
 <input type=hidden name=rsv_idx value=1> 
<input type=hidden name=tn value=baidu>
<span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span>
<span class="bg s_btn_wr"><input type=submit id=su value=ç™¾åº¦ä¸€ä¸‹ class="bg s_btn"></span>
 </form>
 </div>
 </div>
 <div id=u1> 
<a href=http://news.baidu.com name=tj_trnews class=mnav>æ–°é—»</a>
 <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> 
<a href=http://map.baidu.com name=tj_trmap class=mnav>åœ°å›¾</a> 
<a href=http://v.baidu.com name=tj_trvideo class=mnav>è§†é¢‘</a> 
<a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> 
<noscript> 
<a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç™»å½•</a> </noscript>
 <script>
document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === " rel="external nofollow"  rel="external nofollow"  rel="external nofollow" " ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç™»å½•</a>');</script> 
<a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ›´å¤šäº§å“</a> 
</div> 
</div> 
</div> 
<div id=ftCon> 
<div id=ftConw> 
<p id=lh> 
<a href=http://home.baidu.com>å³äºŽç™¾åº¦</a>
 <a href=http://ir.baidu.com>About Baidu</a>
 </p> 
<p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>ä½¿ç”¨ç™¾åº¦å‰å¿
è¯»</a>  
<a href=http://jianyi.baidu.com/ class=cp-feedback>æ„è§åé¦ˆ</a> äº¬ICPè¯030173å·  <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

200

乱码的，是由于没有转换字符，可以加入res.encoding='utf-8'解决，200是状态码。一般状态码是2xx都没什么问题的。

1xx：web服务器正确接收到请求了
2xx：处理成功，比如200表示正常，请求完成；204表示正常无响应等
3xx：重定向
4xx：客户端出现错误，比如著名的404找不到
5xx：服务器出现错误，比如500的内部错误

res.encoding='utf-8'
print(res.text)

<!DOCTYPE html>
<!--STATUS OK-->
<html> 
<head>
<meta http-equiv=content-type content=text/html;charset=utf-8>
<meta http-equiv=X-UA-Compatible content=IE=Edge>
<meta content=always name=referrer>
<link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>
<title>百度一下，你就知道</title>
</head> 
<body link=#0000cc>
 <div id=wrapper>
 <div id=head> 
<div class=head_wrapper> 
<div class=s_form>
 <div class=s_form_wrapper> 
<div id=lg>
 <img hidefocus=true src=//www.baidu.com/img/bd_logo1.jpg width=270 height=129> 
</div> 
<form id=form name=f action=//www.baidu.com/s class=fm>
 <input type=hidden name=bdorz_come value=1> 
<input type=hidden name=ie value=utf-8>
 <input type=hidden name=f value=8> 
<input type=hidden name=rsv_bp value=1>
 <input type=hidden name=rsv_idx value=1> 
<input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=w
d class=s_ipt value maxlength=255 autocomplete=off autofocus></span>
<span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> 
</form>
 </div> 
</div>
 <div id=u1> 
<a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> 
<a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> 
<a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> 
<a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> 
<noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript>
 <script>
document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === " rel="external nofollow"  rel="external nofollow"  rel="external nofollow" " ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
</script>
 <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> 
</div> 
</div>
 </div> 
<div id=ftCon>
 <div id=ftConw>
 <p id=lh> 
<a href=http://home.baidu.com>关于百度</a> 
<a href=http://ir.baidu.com>About Baidu</a> 
</p> 
<p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a>  
<a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a> 京ICP证030173号  <img src=//www.baidu.com/img/gs.gif> 
</p> 
</div>
 </div> 
</div> 
</body>
 </html>

主要的点

（1）.用get请求得到的数据是一个response对象，用response.text属性来查看。
（2）.修改编码形式用response.encoding='utf-8/gbk/...'，encoding是它的一个属性可以查看response.encoding

res.encoding
>>>:
>'utf-8'

（3）.无论响应是文本还是二进制内容，我们都可以用content属性获得bytes对象：

import requests
url='http://www.baidu.com'
res = requests.get(url)
print(res.content)
print("----------")
print(res.text)
print("----------")
print(type(res))

<!DOCTYPE html>\r\n<!--STATUS OK-->
<html> 
<head>
<meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.jpg width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>\xe6\x96\xb0\xe9\x97\xbb</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>\xe5\x9c\xb0\xe5\x9b\xbe</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>\xe8\xa7\x86\xe9\xa2\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>\xe8\xb4\xb4\xe5\x90\xa7</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>\xe7\x99\xbb\xe5\xbd\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === " rel="external nofollow" " ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">\xe7\x99\xbb\xe5\xbd\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">\xe6\x9b\xb4\xe5\xa4\x9a\xe4\xba\xa7\xe5\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>\xe5\x85\xb3\xe4\xba\x8e\xe7\x99\xbe\xe5\xba\xa6</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>\xe4\xbd\xbf\xe7\x94\xa8\xe7\x99\xbe\xe5\xba\xa6\xe5\x89\x8d\xe5\xbf\x85\xe8\xaf\xbb</a>  <a href=http://jianyi.baidu.com/ class=cp-feedback>\xe6\x84\x8f\xe8\xa7\x81\xe5\x8f\x8d\xe9\xa6\x88</a> \xe4\xba\xacICP\xe8\xaf\x81030173\xe5\x8f\xb7  <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
----------
<!DOCTYPE html>
<!--STATUS OK-->
<html>
<head>
<meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge>
<meta content=always name=referrer>
<link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>
<title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title><
/head> <body link=#0000cc> <div id=wrapper> 
<div id=head>
<div class=head_wrapper> <div class=s_form> 
<div class=s_form_wrapper> 
<div id=lg> 
<img hidefocus=true src=//www.baidu.com/img/bd_logo1.jpg width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> 
<input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8>
<input type=hidden name=rsv_bp value=1> 
<input type=hidden name=rsv_idx value=1>
<input type=hidden name=tn value=baidu>
<span class="bg s_ipt_wr">
<input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr">
<input type=submit id=su value=ç™¾åº¦ä¸€ä¸‹ class="bg s_btn">
</span>
</form>
</div> 
</div>
<div id=u1> 
<a href=http://news.baidu.com name=tj_trnews class=mnav>æ–°é—»</a> 
<a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> 
<a href=http://map.baidu.com name=tj_trmap class=mnav>åœ°å›¾</a> 
<a href=http://v.baidu.com name=tj_trvideo class=mnav>è§†é¢‘</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> 
<noscript> 
<a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç™»å½•</a>
</noscript>
<script>
document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === " rel="external nofollow"  rel="external nofollow"  rel="external nofollow" " ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç™»å½•</a>');
</script> 
<a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ›´å¤šäº§å“</a> 
</div>
</div>
</div> 
<div id=ftCon>
<div id=ftConw> 
<p id=lh> 
<a href=http://home.baidu.com>å³äºŽç™¾åº¦</a> 
<a href=http://ir.baidu.com>About Baidu</a>
</p>
<p id=cp>©2017 Baidu 
<a href=http://www.baidu.com/duty/>ä½¿ç”¨ç™¾åº¦å‰å¿è¯»</a> 
<a href=http://jianyi.baidu.com/ class=cp-feedback>æ„è§åé¦ˆ</a>
 äº¬ICPè¯030173å·  <img src=//www.baidu.com/img/gs.gif> 
</p> 
</div>
</div>
</div> 
</body> 
</html>
<class 'requests.models.Response'>

（4）.status_code属性来查看该请求返回的状态码

2.带参数访问url

（1）.带http 的头去访问可以传入参数：headers={'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit'} ，不至于很快就被判断惩恶爬虫，把你的IP给封了。
（2）.Cookie

# 获得指定cookie
r.cookies['cookie_name']
# 传入cookie 用dict来传递
cs = {'token':'密码','status':'状态'}
res = requests.get(url, cookies='cs')

3）.指定超时

res = requests.get(url, timeout=3) #3秒后超时

注意:一般用get方法就可以爬取一些比较简单容易的网站。

4.requests的一些常用方法和主要参数

方法	说明
requests.request()	构造一个请求，用于以下各种方法的处理
requests.get()	获取HTML网页的主要方法，对应于HTTP的GET
requests.head()	获取HTML网页头信息的方法，对应于HTTP的HEAD
requests.post()	向HTML提交POST请求的方法，对应于HTTP的POST
requests.put()	向HTML提交PUT请求的方法，对应于HTTP的PUT
requests.patch()	向HTML提交局部修改请求的方法，对应于HTTP的PATCH
requests.delete()	向HTML提交删除请求的方法，对应于HTTP的DELETE

requests.get()方法的参数：
格式：requests.get(url, params=None, **kwargs) 最前面介绍的几个常用的掌握就够用了。

#url:要访问的url地址
# params:url中的额外参数，可选的，字典或者字典或字节形式传递
# **kwargs:控制访问的参数，可选
## headers，timeout，cookies，data，json，proxies，allow_redirects，stream，veriftty，cert，files，auth

5.requests.Response对象的属性说明

属性	说明
res.status_code	HTTP请求返回的状态码，200表示连接成功，404表示失败
res.text	HTTP响应内容的字符串形式，即url对应的页面内容
res.encoding	从HTTP header中猜测的响应内容的编码形式，乱码可以修改防止乱码
res.content	从内容中分析出的响应内容的编码方式，备用
res.apparent_encoding	HTTP响应内容的二进制形式

xpath简介

述

Xpath是一门在xml文档中查找信息的语言。Xpath可用来在xml文档中对元素和属性进行遍历。由于html的层次结构与xml的层次结构天然一致，所以使用Xpath也能够进行html元素的定位。

定位方法 1.绝对路径定位：

顾名思义，将Xpath表达式从html的最外层节点，逐层填写，最后定位到操作元素，一般浏览器插件出来都是绝对定位
类似：/html/body/div[1]/div[2]/div[5]/div[1]/div[1]/form/span[2]/input

2.相对路径定位

通过相对路径定位元素，提取的是元素的部分特征，只要提取恰当，能够保证版本间稳定，是进行自动化测试的首选。
类似：//div[@class='e']/a/p/span/text() @后面是属性，最后的text()提取标签之间的文本数据

3.索引号定位

类似：/html/body/div[1]/div[2]/div[5]/div[1]/div[1]/form/span[last()-1]/input 表示form下倒数第二个span

4.属性定位

类似：//*[@id=“kw” and @name=‘wd'] 表示 id 属性为 kw 且 name 属性为 wd

5.其它定位方法

还要别的定位方法，不常用，不介绍

lxml简介

导入lxml 的 etree 库

from lxml import etree

简单使用

（1）.利用etree.HTML，将html字符串（bytes类型或str类型）转化为Element对象，Element对象具有xpath的方法，返回结果列表。

html = etree.HTML(text) 
ret_list = html.xpath("xpath语法规则字符串")

（2）.xpath方法返回列表的三种情况

返回空列表：根据xpath语法规则字符串，没有定位到任何元素
返回由字符串构成的列表：xpath字符串规则匹配的一定是文本内容或某属性的值
返回由Element对象构成的列表：xpath规则字符串匹配的是标签，列表中的Element对象可以继续进行xpath

注意：

（1）.lxml.etree.HTML(html_str)可以自动补全标签

（2）.lxml.etree.tostring函数可以将转换为Element对象再转换回html字符串

（3）.爬虫如果使用lxml来提取数据，应该以lxml.etree.tostring的返回结果作为提取数据的依据

实例：爬取51.job的大数据职业信息的第一页【requests+xpath】

分析：打开首页，搜索大数据，定位是兰州，F12调式查看，爬取工作名称和公司名就好了

51.job大数据页面地点定位兰州

位置

job_name

cname

import requests
from lxml import etree
url = "https://search.51job.com/list/270200,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare="
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
res = requests.get(url,headers=header)
res.encoding = "gbk"
#print(res.text)
data = etree.HTML(res.text)#加载成html树
job_name = data.xpath("//div[@class='e']/a/p/span/text()")
cname = data.xpath("/html/body/div[2]/div[3]/div/div[2]/div[4]/div[1]/div/div[2]/a/@title")

加载全部内容