BeautifulSoup：从HTML获取CSS类（获取html的值）

25-03-20 9

以上就是给各位分享BeautifulSoup：从HTML获取CSS类，其中也会对获取html的值进行解释，同时本文还将给你拓展BeautifulSoupHTML获取src链接、BeautifulSou

以上就是给各位分享BeautifulSoup：从HTML获取CSS类，其中也会对获取html的值进行解释，同时本文还将给你拓展BeautifulSoup HTML获取src链接、BeautifulSoup with Recursion：获取 HTML 中子项/最长路径的 html 标签、BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup、BeautifulSoup-获取所有文本，但保留链接html？等相关知识，如果能碰巧解决你现在面临的问题，别忘了关注本站，现在开始吧！

本文目录一览：

BeautifulSoup：从HTML获取CSS类（获取html的值）
BeautifulSoup HTML获取src链接
BeautifulSoup with Recursion：获取 HTML 中子项/最长路径的 html 标签
BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup
BeautifulSoup-获取所有文本，但保留链接html？

BeautifulSoup：从HTML获取CSS类（获取html的值）

有没有一种方法可以从HTML文件中获取CSS类BeautifulSoup？示例片段：

<style type="text/css"> p.c3 {text-align: justify} p.c2 {text-align: left} p.c1 {text-align: center}</style>

完美的输出将是：

cssdict = {    ''p.c3'': {''text-align'': ''justify''},    ''p.c2'': {''text-align'': ''left''},    ''p.c1'': {''text-align'': ''center''}}

尽管这样可以：

L = [    (''p.c3'', {''text-align'': ''justify''}),      (''p.c2'', {''text-align'': ''left''}),        (''p.c1'', {''text-align'': ''center''})]

答案1

小编典典

BeautifulSoup本身根本不解析CSS样式声明，但是您可以提取这些部分，然后使用专用的CSS解析器对其进行解析。

根据您的需求，有多个CSS解析器可用于python。我会选择cssutils（需要python
2.5或更高版本（包括python 3）），它在支持方面是最完整的，并且也支持内联样式。

其他选项是css-py和tinycss。

抓取并解析所有样式部分（例如cssutils的示例）：

import cssutilssheets = []for styletag in tree.findAll(''style'', type=''text/css'')    if not styletag.string: # probably an external sheet        continue    sheets.append(cssutils.parseStyle(styletag.string))

随着cssutil然后你可以结合这些，进口的决心，甚至把它取外部样式表。

BeautifulSoup HTML获取src链接

我正在使用python
3.5.1和request模块制作一个小型网络爬虫，该模块从特定网站下载所有漫画。我正在尝试一页。我使用BeautifulSoup4解析页面，如下所示：

import webbrowserimport sysimport requestsimport reimport bs4res = requests.get(''http://mangapark.me/manga/berserk/s5/c342'')res.raise_for_status()soup = bs4.BeautifulSoup(res.text, ''html.parser'')for link in soup.find_all("a", class_ = "img-link"):    if(link):        print(link)    else:        print(''ERROR'')

当我这样做时，我会print(link)感兴趣的是正确的HTML部分，但是当我尝试仅使用 src 来获取 src中
的链接时，link.get(''src'')它只会打印None。

我尝试使用以下方式获取链接：

img = soup.find("img")["src"]

没关系，但是我想拥有所有的src链接，而不是第一个链接。我对beautifulSoup经验很少。请指出发生了什么事。谢谢。

我感兴趣的网站的示例HTML部分是：

<ahref="#img2">    <img id="img-1"rel="1" i="1" e="0" z="1"           title="Berserk ch.342 page 1" src="http://2.p.mpcdn.net/352582/687224/1.jpg"          width="960" _width="818" _heighth="1189"/>        </a>

答案1

小编典典

我会使用CSS选择器一次性完成此操作：

for img in soup.select("a.img-link img[src]"):    print(img["src"])

在这里，我们得到的所有img具有src属性的元素都位于a具有img-link类的元素下。它打印：

http://2.p.mpcdn.net/352582/687224/1.jpghttp://2.p.mpcdn.net/352582/687224/2.jpghttp://2.p.mpcdn.net/352582/687224/3.jpghttp://2.p.mpcdn.net/352582/687224/4.jpg...http://2.p.mpcdn.net/352582/687224/20.jpg

如果仍要使用find_all()，则必须将其嵌套：

for link in soup.find_all("a", class_ = "img-link"):    for img in link.find_all("a", src=True):  # searching for img with src attribute        print(img["src"])

BeautifulSoup with Recursion：获取 HTML 中子项/最长路径的 html 标签

您可以将递归与生成器一起使用。可以通过迭代 soup.contents 并在每个级别增加一个计数器来遍历 HTML：

from bs4 import BeautifulSoup as soup,NavigableString as ns
def get_paths(d,p = [],c = 0):
   if not (k:=[i for i in getattr(d,'contents',[]) if not isinstance(i,ns)]):
      yield (c,' > '.join(p+[d.name]))
   else:
      for i in k:
         yield from get_paths(i,p=p+[d.name],c = c+1)

_,path = max(get_paths(soup(HTML,'html.parser').html),key=lambda x:x[0])

输出：

'html > body > div > p > span'

BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup

如何解决BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup？

我成功安装了 BeautifulSoup。这是最新的更新。但我仍然得到“

df[''Most_OCCURING''] = df.groupby(''Date'')[''Type''].transform(lambda x: x.value_counts().idxmax())

运行代码时。需要帮助！！

解决方法

试试：

from bs4 import BeautifulSoup

BeautifulSoup-获取所有文本，但保留链接html？

我必须处理一个非常混乱的HTML大型存档，其中充满了多余的表，跨度和内联样式到markdown中。

我正在尝试使用BeautifulSoup来完成此任务，而我的目标基本上是该get_text()函数的输出，除了href完整保留锚标签外。

例如，我要转换：

<td>
    <font><span>Hello</span><span>World</span></font><br>
    <span>Foo Bar <span>Baz</span></span><br>
    <span>Example Link: <a href="https://google.com" target="_blank">Google</a></span>
</td>

进入：

Hello World
Foo Bar Baz
Example Link: <a href="https://google.com">Google</a>

到目前为止，我的思考过程是简单地获取所有标签，如果它们不是锚，则将它们全部解包，但这会导致文本被重复多次，因为soup.find_all(True)递归嵌套的标签作为单独的元素返回：

#!/usr/bin/env python

from bs4 import BeautifulSoup

example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: <a href="https://google.com" target="_blank">Google</a></span></td>'

soup = BeautifulSoup(example_html,'lxml')
tags = soup.find_all(True)

for tag in tags:
    if (tag.name == 'a'):
        print("<a href='{}'>{}</a>".format(tag['href'],tag.get_text()))
    else:
        print(tag.get_text())

当解析器在树中向下移动时，它将返回多个片段/重复项：

HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorld
Hello
World

Foo Bar Baz
Baz

Example Link: Google
<a href='https://google.com'>Google</a>

今天的关于BeautifulSoup：从HTML获取CSS类和获取html的值的分享已经结束，谢谢您的关注，如果想了解更多关于BeautifulSoup HTML获取src链接、BeautifulSoup with Recursion：获取 HTML 中子项/最长路径的 html 标签、BeautifulSoup 已安装但仍出现 ImportError: No module named BeautifulSoup、BeautifulSoup-获取所有文本，但保留链接html？的相关知识，请在本站进行查询。

本文标签：