将字符实体转换为其 Unicode 等效项

作者: 此用户为VIP用户
来源: 51数据库
2023-02-13

问题描述

我在数据库中有 html 编码的字符串，但许多字符实体不仅仅是标准的 & 和 <.“ 和 — 等实体.不幸的是，我们需要将这些数据提供给基于 flash 的 rss 阅读器，而 flash 不会读取这些实体，但它们会读取等效的 unicode(例如 “).

I have html encoded strings in a database, but many of the character entities are not just the standard & and <. Entities like “ and —. Unfortunately we need to feed this data into a flash based rss reader and flash doesn't read these entities, but they do read the unicode equivalent (ex “).

使用 .Net 4.0，是否有任何实用方法可以将 html 编码的字符串转换为使用 unicode 编码的字符实体?

Using .Net 4.0, is there any utility method that will convert the html encoded string to use unicode encoded character entities?

这是我需要的一个更好的例子.该数据库具有 html 字符串，例如:

John &莎拉去看 $ldquo;Scream 4$rdquo;. 而我需要在 rss/xml 文档中用标签输出的是: John &#38;莎拉去看了&#8220;Scream 4&#8221;.

Here is a better example of what I need. The db has html strings like: John & Sarah went to see $ldquo;Scream 4$rdquo;. and what I need to output in the rss/xml document with in the <description> tag is: John &#38; Sarah went to see &#8220;Scream 4&#8221;.

我正在使用 XmlTextWriter 从类似于此示例代码的数据库记录创建 xml 文档 http://www.dotnettutorials.com/tutorials/advanced/rss-feed-asp-net-csharp.aspx

I'm using an XmlTextWriter to create the xml document from the database records similar to this example code http://www.dotnettutorials.com/tutorials/advanced/rss-feed-asp-net-csharp.aspx

所以我需要用他们的 unicode equivilant 替换来自 db 的 html 字符串中的所有字符实体，因为基于 flash 的 rss 阅读器无法识别任何实体，而不是最常见的实体，例如 &.

So I need to replace all of the character entities within the html string from the db with their unicode equivilant because the flash based rss reader doesn't recognize any entities beyond the most common like &.

推荐答案

我的第一个想法是，你的 RSS 阅读器能接受实际的字符吗?如果是这样，您可以使用 HtmlDecode 和提要直接进去.

My first thought is, can your RSS reader accept the actual characters? If so, you can use HtmlDecode and feed it directly in.

如果确实需要将其转换为数字表示，则可以解析出每个实体，HtmlDecode，然后将其转换为 int 以获得基数-10 Unicode 值.然后重新插入到字符串中.

If you do need to convert it to the numeric representations, you could parse out each entity, HtmlDecode it, and then cast it to an int to get the base-10 unicode value. Then re-insert it into the string.

下面是一些代码来演示我的意思(未经测试，但可以理解):

Here's some code to demonstrate what I mean (it is untested, but gets the idea across):

string input = "Something with &mdash; or other character entities.";
StringBuilder output = new StringBuilder(input.Length);

for (int i = 0; i < input.Length; i++)
{
    if (input[i] == '&')
    {
        int startOfEntity = i; // just for easier reading
        int endOfEntity = input.IndexOf(';', startOfEntity);
        string entity = input.Substring(startOfEntity, endOfEntity - startOfEntity);
        int unicodeNumber = (int)(HttpUtility.HtmlDecode(entity)[0]);
        output.Append("&#" + unicodeNumber + ";");
        i = endOfEntity; // continue parsing after the end of the entity
    }
    else
        output.Append(input[i]);
}

我可能在某个地方有一个逐一错误，但应该很接近.

I may have an off-by-one error somewhere in there, but it should be close.