利用 WordPress 函数减少文章中的 HTML

关于是否应该在内容中使用 HTML 类别存在争议。也就是说，那些与内容的呈现方式直接相关的类别。有时，使用这些类别是不可避免的。一个醒目的段落，一个引言，一个文章中间的幻灯片……你需要类别来为这些内容设置样式并添加功能。

虽然有时你需要它们，但在我看来，你在实际文章内容中编写的类别越少越好。

为什么要避免在内容中编写带有类别的 HTML？

主要原因是这些 HTML 类别很脆弱，因为它们与您当前的主题相关联。在下一次重新设计中，这些类别可能会发生变化或需要不同的结构。或者至少，随着时间的推移，某些类别会被遗忘，新的类别会涌现，重复的类别会发生，最终会变得混乱。

更改模板中的 HTML 很容易，因为一个模板负责很多页面。但是更改内容内的 HTML 很困难。它们是独立的，有时有数百甚至数千个，可能需要手动逐个文章更新。

但是我需要这些 HTML 类别！

不用担心，WordPress 足够灵活，可以让我们生成 HTML 并将其插入到正确的位置。
您的内容保持纯净。不再有脆弱的 HTML。保持纯净，您可以轻松地将文章内容转换为您的呈现需求并进行调整。

所有这些转换都可以通过代码完成。下次您更新设计时，您将更新转换函数以生成正确的 HTML。就像模板一样，您在一个地方进行更新，它会立即影响所有内容。

不再需要手动更新文章。

调整内容的策略

在 WordPress 提供的所有工具中，我们将使用

短代码
the_content 过滤器

我将简要解释上述两者是如何工作的，并提供一些使用它们的实际示例。

短代码

短代码允许您定义一个宏，该宏扩展到您选择的某个内容。它们基本上是一种包裹内容并接受属性的 HTML 标签。例如，您可以将此内容放在文章内容中

[my-shortcode foo="bar"]Hello, World![/my-shortcode]

然后编写代码使其转换为

<aside data-foo="bar"><h3>Hello, World!</h3></aside>

然后拥有随时更改该输出的能力。

WordPress 针对短代码有广泛的文档，但我会提供一个简单的示例。

function css_tricks_example_shortcode( $attrs, $content = null ) {

    extract( shortcode_atts( array(
        'twitter' => ''
    ), $attrs ) );

    $twitterURL = 'https://twitter.com/' . $twitter;

    return <<<HTML
<p>This post has been written by $content. Follow him on Twitter</p>
HTML;

}
add_shortcode( 'author', 'css_tricks_example_shortcode' );

这是一个人为的例子，但如果您在 `functions.php` 文件中包含上面的代码，您就可以创建一篇带有以下内容的文章

[author twitter="MisterJack"]Alessandro Vendruscolo[/author]

它将呈现此 HTML

<p>This post has been written by Alessandro Vendruscolo. Follow him on <a href="https://twitter.com/MisterJack">Twitter</a></p>

过滤器

WordPress 有许多可用的过滤器。过滤器是一个函数，它有机会在返回给请求它的实体之前转换某些内容。过滤器主要由插件使用，也是 WordPress 可定制性的原因。

我们将使用的过滤器是 the_content，它在 WordPress 的 Codex 中有一个页面。

以下是如何使用它的基本示例。

function css_tricks_example_the_content( $content ) {
    global $post;
    $title = get_the_title( $post );
    $site = get_bloginfo( 'name' );
    $author = get_the_author( $post );
    return $content . '<p>The post ' . $title . ' on ' . $site . ' is by ' . $author . '.</p>';
}
add_filter( 'the_content', 'css_tricks_example_the_content' );

这将在文章末尾添加文本，这对于 RSS 爬虫很有用。

充分利用 the_content

the_content 过滤器的文档提供了与上面类似的示例，所以让我们做一些不同的事情。在查看相关的技术之后，我们将介绍一些实际的示例。

假设您已经编写了纯净的文章，并使用 JavaScript 在客户端对其进行转换。这是一种非常常见的场景。假设您使用 Markdown 编写并使用三引号代码块。它们会转换为类似这样的 HTML……

<pre><code lang="js">
</code></pre>

但假设您的语法高亮库要求代码块像这样

<pre><code class="language-javascript">
</code></pre>

您可能正在做类似的事情……

$("code.js")
  .removeClass("js")
  .addClass("language-javascript");

// then do other languages

// then run syntax highlighter

这可以工作，但它需要在每个页面加载时进行大量的 DOM 处理。最好在 HTML 传输到浏览器之前修复它。我们将在下面的示例中介绍解决此问题的方法。

结合 HTML（严格来说是 XML）解析器（如 libxml），我们可以将 DOM 转换移回服务器，减轻浏览器的负担。减少前端所需的 JavaScript 量无疑是一个好目标。

libxml 有 PHP 的绑定，通常在标准安装中可用。您需要确保您的服务器具有 PHP > 5.4 和 libxml > 2.6。您可以通过检查 phpinfo() 的输出或使用命令行来检查

php -v
php -i | grep libxml

如果您的服务器不满足这些要求，您应该要求您的系统管理员更新所需的软件包。

解析文章

我们添加的过滤器将接收文章的原始 HTML 并返回转换后的内容。

我们将使用 DOMDocument 类来加载和转换 HTML。我们将使用 loadHTML 实例方法来解析文章，并使用 saveHTML 将转换后的文档序列化回字符串。

有一个小技巧：此类会自动添加 <!doctype html> 定义，还会自动将内容包装在 <html> 和 <body> 标签中。这是因为 libxml 被设计用于解析完整页面，而不是我们正在进行的页面的一部分。

一个可能的解决方案是在加载 HTML 时设置一些标志，但这也不完美。在加载 HTML 时，libxml 期望找到一个根元素，但文章可能有多个根元素（通常，您有很多段落）。在这种情况下，libxml 会抛出一些错误。

我想到的更好的解决方案是子类化 DOMDocument 并重写 saveHTML 函数以去除这些 html 和 body 标签。在加载 HTML 时，我不设置 LIBXML_HTML_NOIMPLIED 标志，因此它不会抛出任何错误。

这不是理想的解决方案，但它可以完成工作。

class MSDOMDocument extends DOMDocument {
    public function saveHTML ( $node = null ) {
        $string = parent::saveHTML( $node );

        return str_replace( array( '<html><body>', '</body></html>' ), '', $string );
    }
}

现在，我们需要在我们的过滤器函数中使用 MSDOMDocument 而不是 DOMDocument。如果您要创建多个过滤器，建议您只解析一次文章，并将 MSDOMDocument 实例传递给各个过滤器。当所有转换完成后，我们将获得 HTML 字符串。

function css_tricks_example_the_content( $content ) {

    // First encode all characters to their HTML entities
    $encoded = mb_convert_encoding( $content, 'HTML-ENTITIES', 'UTF-8' );

    // Load the content, suppressing warnings (libxml complains about not having
    // a root element (we have many paragraphs)
    $html = new MSDOMDocument();
    $ok = @$html->loadHTML( $encoded, LIBXML_HTML_NODEFDTD | LIBXML_NOBLANKS );

    // If it didn't parse the HTML correctly, do not proceed. Return the original, untransformed, post
    if ( !$ok ) {
        return $content;
    }

    // Pass the document to all filters
    css_tricks_content_filter_1( $html );
    css_tricks_content_filter_2( $html );

    // Filtering is done. Serialize the transformed post
    return $html->saveHTML();

}
add_filter( 'the_content', 'css_tricks_example_the_content' );

更改内容的示例

我们已经了解到，我们可以使用短代码和 libxml 来减少我们必须直接插入文章中的 HTML 量。可能很难理解我们可以获得什么结果，所以让我们来看一些实际示例。

以下许多示例来自 MacStories 的生产版本。其他示例是 Chris 的想法，这些想法可以轻松地添加到 CSS Tricks 中（或者已经在使用）。

精选引言

你的网站可以有精选引言。理想的 HTML 代码可能类似于

<p>lorem ipsum dolor…</p>
<div class='pull-quote-wrapper'>
  <blockquote class='pull-quote-content'>This is the content of the pull quote</blockquote>
  <span class='pull-quote-author'>Author</span>
</div>
<p>lorem ipsum and the rest of the post</p>

为了实现类似的效果，我建议使用一个短代码

function css_tricks_pull_quote_shortcode( $attrs, $content = null ) {

    extract( shortcode_atts( array(
        'author' => ''
    ), $attrs ) );

    $authorHTML = $author !== '' ? "<span class='pull-quote-author'>$author</span>" : '';

    return <<<HTML
<div class='pull-quote-wrapper'>
  <blockquote class='pull-quote-content'>$content</blockquote>
  $authorHTML
</div>
HTML;

}
add_shortcode( 'pullquote', 'css_tricks_pull_quote_shortcode' );

然后在你的文章中这样使用

lorem ipsum dolor…

[pullquote author="Mr. Awesome"]This is the content of the pull quote[/pullquote]

lorem ipsum and the rest of the post

作者是可选的，处理它的函数如果作者未设置，则会完全从 HTML 代码中省略它。

这样做有很多优势

如果你需要不同的 HTML 或不同的 HTML 类，可以在一个地方更新函数的输出。
如果你想完全放弃精选引言，可以从函数中返回一个空字符串。
如果你想添加一个功能（例如点击推文），则更新函数的输出。

Twitter/Instagram 嵌入

在我看来，WordPress 最棒的功能之一是自动嵌入。假设你想在文章中插入外部内容：你很可能只需要在单独的行上插入 URL 就能完成任务。无需再费力寻找正确的嵌入代码。最重要的是，你无需保持它更新。

这被称为 oEmbed，支持的提供商列表可以这里和这里查看。

WordPress 有一个钩子可以自定义这些嵌入。如果你想将嵌入内容包裹在 div 中，可以执行以下操作

function macstories_wrap_embeds ( $return, $url, $attr ) {
    return <<<HTML
        <div class='media-wrapper'>$return</div>
HTML;
}
add_filter( 'embed_handler_html', 'macstories_wrap_embeds', 10, 3 );

function macstories_wrap_oembeds ( $cache, $url, $attr, $id ) {
    return <<<HTML
        <div class='media-wrapper'>$cache</div>
HTML;
}
add_filter( 'embed_oembed_html', 'macstories_wrap_oembeds', 10, 4 );

语法高亮

你可以在服务器上处理代码块，为每一行添加行号。这样你就可以直接将代码插入到 pre 和 code 代码块中。

这是使用 the_content 过滤器和 libxml 实现的

搜索所有代码块
通过换行符分割获取所有行
将每一行包裹在 span 中
应用 CSS

处理程序还会根据语法高亮的要求更改类（如之前示例中所述）。

function css_tricks_code_blocks_add_line_numbers( $html ) {

    // Iterating a nodelist while manipulating it is not a good thing, because
    // the nodelist dynamically updates itself. Get all code elements and put
    // only the ones that are direct children of pre element in an array
    $codeBlocks = array();
    $nodes = $html->getElementsByTagName( 'code' );
    foreach ( $nodes as $node ) {
        if ( $node->parentNode->nodeName == 'pre' ) {
            $codeBlocks[] = $node;
        }
    }

    foreach ( $codeBlocks as $code ) {

        // Fix HTML classes
        $lang = $code->getAttribute( 'lang' );
        $code->removeAttribute( 'lang' );
        if ( $lang === 'js' ) {
            $code->setAttribute( 'class', 'language-javascript' );
        }
        // Probably add some more `else if` blocks...

        // Get the actual code snippet
        $snippet = $code->textContent;

        // Split in lines
        $lines = explode("\n", $snippet);

        // Remove all code
        $code->nodeValue = '';

        // Each line must be wrapped in its own element. Encode entities to be
        // sure that libxml doesn't complain
        foreach ( $lines as $line ) {
            $wrapper = $html->createElement('span');
            $wrapper->setAttribute( 'class', 'code-line' );

            // Create a text node, to have full escaping support
            $textNode = $html->createTextNode( $line . "\n" );

            // Add the text to span
            $wrapper->appendChild( $textNode );

            // Add the span to code
            $code->appendChild( $wrapper );
        }

        // Jetpack adds a newline at the end of the code block. Remove that
        if ( $code->lastChild->textContent == '' ) {
            $code->removeChild( $code->lastChild );
        }

    }

}

你可以使用 CSS 计数器来生成数字

.code-line {
    display: block;
    counter-increment: line-number;

    &::before {
        content: counter(line-number);
        display: inline-block;
        width: 30px;
        margin-right: 10px;
    }
}

来自 MacStories 的一个现实世界的例子是，我们可以这样写 Markdown

```js
// This is a JS code block
var string = "hello";
var what = "world";
var unusedVar = 3;
alert(string + " " + what); // Actually do something
```

它会被处理成 HTML，然后通过该过滤器发送，最终变成这样

<pre><code class='javascript'><span class='code-line'>// This is a JS code block</span>
<span class='code-line'>var string = "hello";</span>
<span class='code-line'>var what = "world";</span>
<span class='code-line'>var unusedVar = 3;</span>
<span class='code-line'>alert(string + " " + what); // Actually do something</span></code></pre>

它会渲染成这样，使用我们的语法高亮

重写 URL

当我们在 MacStories 上切换到 HTTPS 时，我们遇到了混合内容警告问题。旧文章链接到托管在 Rackspace 上的图片，使用的是 HTTP 协议。糟糕。

幸运的是，Rackspace 也通过 HTTPS 提供内容，但 URL 略有不同。

我们决定添加一个过滤器来更改这些 URL。编辑人员会使用 HTTPS URL 链接图片，但这个过滤器可以解决错误插入的 HTTP URL。告别混合内容警告。

这是通过添加 the_content 过滤器并运行正则表达式替换实现的。

function macstories_rackspace_http_to_https( $content ) {
    return preg_replace(
        '/http:\/\/([A-z0-9]+-[A-z0-9]+\.)r[0-9]{1,2}(\.cf1\.rackcdn\.com\/)/i',
        'https://$1ssl$2',
        $content
    );
}

你可以执行类似的操作来对图片链接进行 CDN 化：如果你的图片 URL 具有明确定义的模式（以便你不会更改非图片的 URL），可以使用类似的方法。否则，最好解析 HTML 代码来更改图片的 src 属性。

为标题添加 ID

为所有标题设置 id 属性，可以让你链接到特定的部分（例如，当你有一个目录或想分享一个链接到特定部分时）。

如果你用 HTML 编写，可以手动添加它们。但这很繁琐。如果你用 Markdown 编写，你必须确保你的 Markdown 处理器会添加它们（Jetpack 不会）。无论如何，手动添加它们会给你的内容增加冗余。

你可以使用 libxml 在 the_content 过滤器中自动执行此过程

搜索所有标题
生成 slug
将该 slug 设置为 id 属性

过滤器如下

function css_tricks_add_id_to_headings( $html ) {

    // Store all headings of the post in an array
    $tagNames = array( 'h1', 'h2', 'h3', 'h4', 'h5', 'h6' );
    $headings = array();
    $headingContents = array();
    foreach ( $tagNames as $tagName ) {
        $nodes = $html->getElementsByTagName( $tagName );
        foreach ( $nodes as $node ) {
            $headings[] = $node;
            $headingContents[ $node->textContent ] = 0;
        }
    }

    foreach ( $headings as $heading ) {

        $title = $heading->textContent;

        if ( $title === '' ) {
            continue;
        }

        $count = ++$headingContents[ $title ];

        $suffix = $count > 1 ? "-$count" : '';

        $slug = sanitize_title( $title );
        $heading->setAttribute( 'id', $slug . $suffix );
    }

}

这个过滤器还会防止生成重复的 id。

删除包裹的段落

如果自动嵌入是我最喜欢的 WordPress 功能，那么自动段落换行就是我最讨厌的功能。这个问题众所周知。

使用正则表达式删除它们有效，但不适合处理 HTML 标签。我们可以使用 libxml 从图片和其他元素（如 picture、video、audio 和 iframe）中删除包裹的段落。

function css_tricks_content_remove_wrapping_p( $html ) {

    // Iterating a nodelist while manipulating it is not a good thing, because
    // the nodelist dynamically updates itself. Get all things that must be
    // unwrapped and put them in an array.
    $tagNames = array( 'img', 'picture', 'video', 'audio', 'iframe' );
    $mediaElements = array();
    foreach ( $tagNames as $tagName ) {
        $nodes = $html->getElementsByTagName( $tagName );
        foreach ( $nodes as $node ) {
            $mediaElements[] = $node;
        }
    }

    foreach ( $mediaElements as $element ) {

        // Get a reference to the parent paragraph that may have been added by
        // WordPress. It might be the direct parent node or the grandparent
        // (LOL) in case of links
        $paragraph = null;

        // Get a reference to the image itself or to the link containing the
        // image, so we can later remove the wrapping paragraph
        $theElement = null;

        if ( $element->parentNode->nodeName == 'p' ) {
            $paragraph = $element->parentNode;
            $theElement = $element;
        } else if ( $element->parentNode->nodeName == 'a' &&
                $element->parentNode->parentNode->nodeName == 'p' ) {
            $paragraph = $element->parentNode->parentNode;
            $theElement = $element->parentNode;
        }

        // Make sure the wrapping paragraph only contains this child
        if ( $paragraph && $paragraph->textContent == '' ) {
            $paragraph->parentNode->replaceChild( $theElement, $paragraph );
        }
    }

}

添加 `rel=noopener`

最近我们意识到了安全问题，与在新标签页中打开链接有关。

添加 rel=noopener 属性可以解决问题，但这并不是编辑人员必须记住的事情。它也不适用于 Markdown，因为你必须用纯 HTML 编写链接。

libxml 可以帮助我们

function css_tricks_rel_noopener( $html ) {

    $nodes = $html->getElementsByTagName( 'a' );
    foreach ( $nodes as $node ) {
        $node->setAttribute( 'rel', 'noopener' );
    }

}

注意事项

自从 MacStories 4 发布以来，我一直使用上面介绍的技术，没有遇到任何重大问题。作家可以专注于编写精彩的内容。所有与演示相关的转换/生成都记录在代码中，可以轻松地移植到新版本或更新到新设计。这是一个巨大的胜利。我无需再创建 legacy-theme.css 文件来设置样式或修复旧的（且糟糕的）决策。

使用内容过滤器，你几乎可以做任何你想做的事情。使用短代码，你需要小心，不要创建过于专门的短代码，这些代码看起来像你过去使用的原始 HTML。例如

[bad-shortcode align="left" color="blue" font="georgia"]…[/bad-shortcode]

将来，其中一些属性可能不再有意义，因此，你需要决定哪些属性看起来很适合并且足够抽象以永远使用。不过，即使是糟糕的短代码也比没有内容抽象好。

最终：做你认为最正确的事情，在实施之前三思而后行。始终问问自己“当下一个设计上线时，我还会需要它吗？”

# 2016 年 7 月 5 日

以前从未在 PHP 代码中见过 <<<HTML 或 HTML – 而且长时间的 iNet 搜索也一无所获 – 能否为我们解释一下代码？

Alessandro Vendruscolo

评论永久链接# 2016 年 7 月 5 日

那是 Heredoc 构造字符串的方式。它们的行为与使用双引号创建的字符串相同。以下是官方文档。

@tricksTroop – 也许你还没有考虑过，网站上被包裹在代码框中的很多代码会在代码框底部生成一个水平滚动条。在短代码段中这不是问题，但当它在屏幕底部垂直滚动时，会导致代码示例几乎无法使用。你必须滚动到底部才能滚动到右边，才能阅读代码，而代码又会完全在屏幕外，让你不得不反复向上滚动才能阅读你想要查看的内容 – 一遍又一遍。

不好玩 – 你们能不能想出一个方法，将水平滚动条固定在代码框底部，以便你在屏幕上垂直滚动时它仍然可用？或者，我看到其他人通过一种方法解决了问题，当光标在框内时，只需允许代码向右溢出框即可。

一定有更好的方法（？？）

Chris Coyier

评论永久链接# 2016 年 7 月 5 日
我打赌
```
pre { max-height: 100vh; } 
```
可能会解决问题。

不过，我要把这个话题埋了，因为它与这里的内容无关。如果你想谈论这类问题，请随时通过联系页面与我们联系。

Steve Lombardi

# 2016 年 7 月 6 日

我遇到过这种情况，我需要显示实时运行的 JavaScript 组件（例如 Angular JS，例如 Angular-ui 手风琴），作为示例，后面跟着一个代码块。

实际上，代码在页面上出现了两次：运行时和示例代码（像许多网站一样），虽然我的公司希望所有的代码都在我们的网站上，而不是作为嵌入式笔。

我尝试了类似于文章中方法，虽然没有那么先进。我可以使用短代码来实现像你那样的高亮块，但这似乎需要在可视化模式下工作。但运行的代码需要以原始 HTML 的形式插入。切换到可视化模式将擦除此 HTML。

我可以让其中一个工作，但不能同时让两个都工作。有什么想法吗？

Alessandro Vendruscolo

评论永久链接# 2016 年 7 月 7 日

我会使用一个短代码，它将代码输出两次。一个用 `pre` 和 `code` 标签包装，以便用户可以阅读，另一个用 `script` 标签包装，用于实时示例。

wefwef

# 2016 年 7 月 8 日

function example() {
element.innerHTML = “code”;
}