Reader View in Firefox and Safari for Multiple Articles on the Same Page

If you use Firefox or Safari, you may have noticed the Reader View presentation mode. When switching to Reader View, the article contents of the page are presented cleanly, minus the rest of the page, such as links, sidebars, header, comments, and so forth. Reader View is more obvious in Firefox, as an icon shows up in the browser bar when it is available for a page. In Safari it is buried in the view menu. In both cases it is inconsistently available. I'd go so far as to say that it is something of a mystery to the casual user as to why it is available for some pages and not for others, and that mystery survives deeper inspection until you delve all the way down in to the nuts and bolts of the source code - which is available in the case of Firefox and the Chrome plugins based on it, but less so for Safari. There is very little documentation out there on how to structure pages so that they will be available in Reader View.

I looked into this in a very casual way for my sites following a reader query, and was quickly mystified. Reader View doesn't work for many CNN articles. It does work for most Washington Post articles. It doesn't seem to matter whether or not there is rich content on the page in the form of embedded video. Reader mode won't work on saved web pages reloaded from file (which seems a strange design choice to me). It does work for almost all individual WordPress blog posts. Reader View is never presented as an option for news site home pages, which is fair enough since they don't tend to contain a lot of article text. However, it fairly reliably fails to show up for WordPress home pages that do contain the full text of multiple posts. Yet, it does work for some such home pages. On the whole the first impression of Reader View from an experimenter's point of view is that some form of complicated and flawed black magic is going on under the hood.

The Challenge of Reader View for Multiple Articles on the Same Page

I mention WordPress home pages that present multiple full text articles as that is the Reader View use case that isn't working for me. I have an example of a site that fails, my site, and an example of a site that works, a fortuitous discovery from the experimentation stage. The major difference between the two is that my site uses semantic HTML5 elements such as <article> and the working WordPress site does not. It didn't take a great deal of further experimentation to find out that the presence of multiple <article> elements was in fact the blocking issue. On replacing <article> with <div>, Reader View started working. Obviously this isn't a very satisfactory solution, as removing the <article> elements will probably produce a variety of other, unwanted effects in the ecosystem of tools, search engines, and readers. In theory we should all aim to be more semantic rather than less. Further, the likely situation here is that Reader View intentionally does not trigger if it can identify that the browser is rendering something other than a single article page - though I'd argue that it should in fact be available for pages that display the full text of multiple articles.

I should note that the point about multiple <article> elements above is true for my case, but by no means straightforward for all cases. Whether or not the presence of multiple <article> elements disrupts Reader View depends on their nesting, attributes, and contents. For example if you use <article> to wrap individual post comments, add a class="comment" attribute, and put those comments inside a <section>, then Reader View is perfectly happy with that and will correctly render the main post that is also wrapped as an <article>. Some other potentially ambiguous situations are resolved in favor of one <article> of many on the page. Yet other equally ambiguous situations fail and Reader View is not enabled.

Web Developers Need More Control Over Reader View

The whole situation for this feature is problematic from the point of the web developers building the sites. We have no control over the activation of Reader View, and the only documentation is the code - and for Safari we don't even have that. The browser developers have taken it upon themselves to best guess all the use cases, and thereby fail miserably for a good number of them. All of the ambiguity and mystery could be avoid via use of some kind of optional flags in the HTML, such as a <meta property="reader-view-article-ids" content="article-id-1,..." /> element in the page head coupled with <article id="article-id-1"> to indicate which content should be displayed by the reader.

Lacking that Control, Can the Issue be Addressed?

Is it possible to find a decent way to keep semantic HTML5 and also force Reader View to display multiple articles on a page? A place to start looking is the _grabArticle method in Readability.js. That is where the page is broken down and various metrics assessed against its contents to try to figure out which of the many options is the article to display - or whether to give up. Given a set of saved HTML pages to experiment with, it is fairly easy to set up with Node.js, readability-node, and jsdom to tinker with the Readability library and see what might work. Though note that there are numerous different versions of this library distributed by various groups and individuals - be sure that you are using the right one. Alternatively, one can use readable-proxy to view pages and Reliability output if that is more convenient than tinkering with code.

Interestingly, my testing here showed that the Readability library does in fact work the way I want it to for a blog home page with multiple sibling <article> elements: it displays them all in a suitable way, just the same as it does for a single blog post page. In fact the library is resilient and does a pretty good job at correctly guessing what it should show for a wide variety of the DOM structures that I'd like to see converted into a readable view. Yet in Firefox, a blog home page with multiple sibling <article> elements doesn't show the Reader View icon. That outcome must thus be driven by logic within the surrounding Firefox code that uses the Readability library, or possibly version differences between libraries. The places to start looking at that include toolkit/components/reader and browser/modules/ReaderParent.jsm, though bear in mind that the gecko-dev repository mirror is going to be somewhat ahead of the currently released version.

Firefox handles updates between components by passing around messages. Following the code for the Reader:UpdateReaderButton message that determines whether or not to show the Reader View icon in the toolbar leads to browser/base/content/tab-content.js:

    if (ReaderMode.isProbablyReaderable(content.document)) {
      sendAsyncMessage("Reader:UpdateReaderButton", { isArticle: true });
    } else if (forceNonArticle) {
      sendAsyncMessage("Reader:UpdateReaderButton", { isArticle: false });
    }

Then in browser/modules/ReaderParent.jsm:

      case "Reader:UpdateReaderButton": {
        let browser = message.target;
        if (message.data && message.data.isArticle !== undefined) {
          browser.isArticle = message.data.isArticle;
        }
        this.updateReaderButton(browser);
        break;
      }

Leaving out the impact of a few epicycles and configuration options, the Reader View icon is shown following this.updateReaderButton(browser) if browser.isArticle is true, and that is true if ReaderMode.isProbablyReaderable() returns true. My testing with the Node.js package Readability library shows that yes, isProbablyReaderable() does return true for the blog home page I care about, while in Firefox it clearly does not.

Next I compared the libraries from readability-node and gecko-dev, at the master branches of both. They are the same. I am working with the released Firefox 50.0.2 rather than the master branch, however. That is a recently release, only a few days old at the time of writing, but it won't be the latest master branch code. Indeed, if I then compare the versions at master branch gecko-dev and release branch gecko-dev, there are substantial differences. It is an earlier version, and one that doesn't do anywhere near as well with pages that contain multiple full text sibling articles.

In Summary, I Should Do Nothing and Wait

Given all of this, it is clear that all I need to do is wait. The master branch works for my use case, and that will percolate through the Mozilla release process over the next few months. To the extent that Safari and the Chrome plugins are all using the same library, and it seems very likely that this is the case based upon the great similarity of their behavior, they also will start to work as their release cycles catch up. If I'd put off this investigation for another few months, I'd wouldn't have needed to spend any time on it at all. There is a certain value in putting off low-priority tasks when working in an ecosystem characterized by a high rate of change; there is a fair chance that they will become irrelevant one way or another. Still, I think my point about greater developer control over Reader View stands; if you have to inspect the code to figure out how exactly it works, it will remain fairly obscure to most developers. A switch to force it on or off, or to target it to specific elements on the page would be very helpful.