用户登录
用户注册

分享至

如何让 BeautifulSoup 4 尊重自闭标签?

  • 作者: 丶InMyHeart
  • 来源: 51数据库
  • 2022-10-28

问题描述

这个问题是针对 BeautifulSoup4 的问题,这使得它不同于以前的问题:

This question is specific to BeautifulSoup4, which makes it different from the previous questions:

BeautifulSoup 为什么要修改我的自闭合元素?

BeautifulSoup 中的 selfClosingTags

由于 BeautifulStoneSoup 已经消失(之前的 xml 解析器),我怎样才能让 bs4 尊重一个新的自闭合标签?例如:

Since BeautifulStoneSoup is gone (the previous xml parser), how can I get bs4 to respect a new self-closing tag? For example:

import bs4   
S = '''<foo> <bar a="3"/> </foo>'''
soup = bs4.BeautifulSoup(S, selfClosingTags=['bar'])

print soup.prettify()

不会自动关闭 bar 标签,但会给出提示.bs4 所指的这个树生成器是什么以及如何自我关闭标签?

Does not self-close the bar tag, but gives a hint. What is this tree builder that bs4 is referring to and how to I self-close the tag?

/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:112: UserWarning: BS4 does not respect the selfClosingTags argument to the BeautifulSoup constructor. The tree builder is responsible for understanding self-closing tags.
  "BS4 does not respect the selfClosingTags argument to the "
<html>
 <body>
  <foo>
   <bar a="3">
   </bar>
  </foo>
 </body>
</html>

推荐答案

解析你传入的XMLxml"作为 BeautifulSoup 构造函数的第二个参数.

soup = bs4.BeautifulSoup(S, 'xml')

您需要安装 lxml.

您不再需要传递 selfClosingTags:

In [1]: import bs4
In [2]: S = '''<foo> <bar a="3"/> </foo>'''
In [3]: soup = bs4.BeautifulSoup(S, 'xml')
In [4]: print soup.prettify()
<?xml version="1.0" encoding="utf-8"?>
<foo>
 <bar a="3"/>
</foo>
软件
前端设计
程序设计
Java相关