如何使用 bs4 或 lxml 在 Python 中找到 XML 标记的文本行？（python xml findall）

25-03-09 12

在这篇文章中，我们将为您详细介绍如何使用bs4或lxml在Python中找到XML标记的文本行？的内容，并且讨论关于pythonxmlfindall的相关问题。此外，我们还会涉及一些关于lxml无法解

在这篇文章中，我们将为您详细介绍如何使用 bs4 或 lxml 在 Python 中找到 XML 标记的文本行？的内容，并且讨论关于python xml findall的相关问题。此外，我们还会涉及一些关于lxml无法解析xml(其他编码是否为utf-8)[python]、python 3.6 lxml标准库lxml的安装及etree的使用注意、Python lxml 使用、Python lxml无法获取所有文本的知识，以帮助您更全面地了解这个主题。

本文目录一览：

如何使用 bs4 或 lxml 在 Python 中找到 XML 标记的文本行？（python xml findall）
lxml无法解析xml(其他编码是否为utf-8)[python]
python 3.6 lxml标准库lxml的安装及etree的使用注意
Python lxml 使用
Python lxml无法获取所有文本

如何使用 bs4 或 lxml 在 Python 中找到 XML 标记的文本行？（python xml findall）

对于 BeautifulSoup，此属性存储在 Tag 类的 sourceline attribute 中，并填充在解析器 here 和 here 中。

对于 lxml，这也可以通过 sourceline 属性实现。下面是一个例子：

#!/usr/bin/python3
from lxml import etree
xml = '''
<a>
  <b>
    <c>
    </c>
  </b>
  <d>
  </d>
</a>
'''
root = etree.fromstring(xml)

for e in root.iter():
    print(e.tag,e.sourceline)

输出：

a 2
b 3
c 4
d 7

如果您想查看 sourceline method 的实现，它实际上是在调用 xmlGetLineNo，它是来自 libxml2 的 xmlGetLineNo 的绑定，它是 xmlGetLineNoInternal 的包装器（其中其实际逻辑存在于 libxml2 中）。

您也可以通过检查该标签的子树的文本表示中有多少行结尾来find the line number of the closing tag。

xml.etree.ElementTree 可以 also be extended 提供解析器找到元素的行号（解析器是模块 xml.parsers.expat 中的 xmlparser）。

尝试使用 enumerate() 函数。

例如，如果我们有以下 HTML：

html = """
<!DOCTYPE html>
<html>
<body>
<h1>My Heading</h1>
<p>My paragraph.</p>
</body>
</html>"""

我们希望找到 <h1> 标记 (<h1>My Heading</h1>) 的行号。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"html.parser")

for (index,value) in enumerate(
    # Remove all the empty lines,so that they shouldn't be part of the line count
    (x for x in str(soup).splitlines() if x != ""),start=1,):
    # Specify the tag you want to find
    # If the tag is found,it will return `1`,else `-1`
    if value.find("h1") == 1:
        print(f"Line: {index}.\t Found: '{value}' ")
        break

输出：

Line: 4.     Found: '<h1>My Heading</h1>'

lxml无法解析xml(其他编码是否为utf-8)[python]

我的代码：

import re
import requests
from lxml import etree

url = 'http://weixin.sogou.com/gzhjs?openid=oIWsFt__d2wSBKMfQtkFfeVq_u8I&ext=2JjmXOu9jMsFW8Sh4E_XmC0DOkcPpGX18Zm8qPG7F0L5ffrupfFtkDqSOm47Bv9U'

r = requests.get(url)

items = r.json()['items']

>没有编码(‘utf-8’)：

etree.fromstring(items [0])输出：

ValueError                                
Traceback (most recent call last)
<ipython-input-69-cb8697498318> in <module>()
----> 1 etree.fromstring(items[0])

lxml.etree.pyx in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)()

parser.pxi in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102435)()

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

> with encode(‘utf-8’)：

etree.fromstring(items [0] .encode(‘utf-8’))输出：

  File "<string>", line unkNown
XMLSyntaxError: CData section not finished
鎶楀啺鎶㈤櫓鎹锋姤:闃冲寳I绾挎, line 1, column 281

不知道解析这个xml ..

解决方法:

作为解决方法,您可以在将字符串传递给etree.fromstring之前删除编码属性：

xml = re.sub(r'\bencoding="[-\w]+"', '', items[0], count=1)
root = etree.fromstring(xml)

看到@ Lea在问题中的评论后更新：

使用显式编码指定解析器：

xml = r.json()['items'].encode('utf-8')
root = etree.fromstring(xml, parser=etree.XMLParser(encoding='utf-8'))

python 3.6 lxml标准库lxml的安装及etree的使用注意

据我所知，python 3.5之后的lxml模块里面不再包含etree，那么要怎么解决这个问题呢？

lxml模块下的etree函数的使用问题，部分lxml模块不再支持etree方法，因此只能想办法下载了etree，我的python版本是3.6，默认使用pip安装lxml，其版本是3.8.0，然后我尝试在程序中导入etree结果失败....后来想到个方法：找到与自己安装的python版本相对应的lxml，比如我的是python 3.6，我就安装lxml-3.7.3-cp36-cp36m-win_amd64.whl，先去官网找到这个包，然后复制到相关目录，使用pip安装，我的安装命令是:pip install lxml-3.7.3-cp36-cp36m-win_amd64.whl

随后就能使用etree了

python3.6.4安装lxml4.1.0可以引入etree

pip install lxml==4.1.0

Python lxml 使用

lxml，是python中用来处理xml和html的功能最丰富和易用的库

from lxml import etree
from lxml import html



h =  ''''''

<html>
　　<head>
　　　　<meta name="content-type" content="text/html; charset=utf-8" />
　　　　<title>友情链接查询 - 站长工具</title>
　　　　<!-- uRj0Ak8VLEPhjWhg3m9z4EjXJwc -->
　　　　<meta name="Keywords" content="友情链接查询" />
　　　　<meta name="Description" content="友情链接查询" />

　　</head>
　　<body>
　　　　<h1>Top News</h1>
　　　　<p>World News only on this page</p>
　　　　Ah, and here''s some more text, by the way.
　　　　<p>... and this is a parsed fragment ...</p>

　　　　<a href="http://www.cydf.org.cn/" rel="nofollow" target="_blank">青少年发展基金会</a> 
　　　　<a href="http://www.4399.com/flash/32979.htm" target="_blank">洛克王国</a> 
　　　　<a href="http://www.4399.com/flash/35538.htm" target="_blank">奥拉星</a> 
　　　　<a href="http://game.3533.com/game/" target="_blank">手机游戏</a>
　　　　<a href="http://game.3533.com/tupian/" target="_blank">手机壁纸</a>
　　　　<a href="http://www.4399.com/" target="_blank">4399小游戏</a> 
　　　　<a href="http://www.91wan.com/" target="_blank">91wan游戏</a>

　　</body>
</html>

''''''
# 第一种使用方法
page = etree.HTML(h)
#hrefs = page.xpath(''//a'')
href = page.cssselect(''a'')
for href in hrefs:
     print(href.attrib)

第二种使用方法
def parse_from():
    tree = html.fromstring(h)
    for href in tree.cssselect(''a''):
    #for hfre in tree.xpath(''//a''):
        a = href
        print(a.text)
        print(a.attrib)

paese_from()


parse_from()

Python lxml无法获取所有文本

您可以使用元素的.itertext()方法：

from lxml.etree import HTML

text = '<p>FIRST PART<a href="THE LINK" target="_blank">LINK TEXT</a>SECOND PART</p>'
parsed = HTML(text)

parent = parsed.xpath('//a/parent::*')[0]
text = list(parent.itertext())
print(text[0])
print(text[-1])

打印：

FIRST PART
SECOND PART

今天的关于如何使用 bs4 或 lxml 在 Python 中找到 XML 标记的文本行？和python xml findall的分享已经结束，谢谢您的关注，如果想了解更多关于lxml无法解析xml(其他编码是否为utf-8)[python]、python 3.6 lxml标准库lxml的安装及etree的使用注意、Python lxml 使用、Python lxml无法获取所有文本的相关知识，请在本站进行查询。

本文标签：