Saving web pages as PDFs in 2019, a real challenge

Today's web technology doesn't easily allow PDF export from a web browser, something that you'd almost take for granted would work seamlessly, and instead, this is not the case.

I use to export locally on my machine web pages that I find interesting, for archiving purposes. I want to have valuable content available to me offline, so I can still read it even if one day it might disappear from the web, or be behind a paywall, or whatever.
Saving pages as HTML is not ideal because a) you get an HTML file plus a folder, not very practical if you want to retrieve them later, and b) you never know how that page is going to render in future versions of your browser. So the easiest, pain-free option is to print the page and choose the PDF format as an output. Except I came to realise that on a Mac, Firefox cannot always print pages properly, and would often screw up the content, so for example, you may end up with a PDF document that has just the title and maybe some images, followed by a blank page. Chrome is somehow better because at least it shows a preview, but can also fail to print an entire article properly. After some very limited testing, Safari seems to generate the best PDF export, but it's not always the case.

On top of that, overlays often contribute to messing up the page content. Take the obnoxious cookies banner pop-ups that are now proliferating and infesting web pages worldwide, making them unusable (*). If left open, they overlay stays on top of the exported content, thus hiding a good 2-3 lines of text.

I am not sure whether it's web developers, QAs, browser vendors of the W3C to blame for that, though I am inclined to say the latter: creating custom CSS for print used to be a relatively simple task, and making a page printable was considered best practice. If this is disregarded even on major news websites, I suspect it's getting too difficult to implement. Whatever the reason, I do want to observe is that after nearly 30 years of World Wide Wed, one of the most basic pieces of functionality in the most commonly used browsers is broken. Of course there's add-ons and hacks, but I am tired of resorting to third party tools and waste plenty of time in order to get the most basic things done.

As an example for test purposes, here is the URL of a page from the MIT review website:  
https://www.technologyreview.com/s/614605/sorryorganic-farming-is-actually-worse-for-climate-change

It gets even worst with a Reddit page like this one: https://www.reddit.com/r/musictheory/comments/bklgdd/any_recommended_music_theory_books/

===


(*) Topic for another rant. EU citizens have to say thank you to the bureaucrats in the EU commission who know nothing about technology, and are happy having lawyers deciding how web content should look like. The result is a mechanism that is appalling in terms of user experience, and addresses the problem it rightly wants to address in the worst possible manner. What's worst, that doesn't seem to have generated any outcry in the design community, and most UX experts seems to be totally fine with that, judging from the silence around this topic.

Practical solutions

Beyond the fact that this is an issue that should be addressed from the bottom down so standards are observed more widely and supported by browser technology, here are some solutions that I've found thanks to some people contacting me privately and some comments I've received on hacker news.

Reader mode (Firefox only, as of end 2019)

Many including me might have not noticed, in Firefox you can turn a page dense of graphics into a much tidier, content-focused version, by clicking on the icon at the right edge of the address bar, next to the zoom icon and the bookmarks icon. It doesn't always appear, only if the page code allows for it to work nicely. If it does, you can then call out the print dialogue, and then go for Open in Preview and from the File menu, move to the folder where you wish to save the PDF. I wonder why Firefox doesn't have an easier way to get a preview of what you are printing. Chrome used to have a similar "Distill" feature but apparently, it's not available any more since the v75 update.

Not sure about Safari. I don't use Safari despite it's the most performant browser on Mac machine. I find the user experience mediocre, and cannot find all the add-ons that I need daily. I used Vivaldi and thought the best of it, but finally gave up for similar reasons.

Add-ons to export as PDF

I value the fact that a PDF is a format that offers longevity in terms of support, and once it's generated, it looks the way that it looks, and that's not going to change significantly in the future. The content is also searchable, but all contained within a simple file that can be easily previewed and sits nicely in the file manager as a single entity. There's add-ons that help make the PDF export tidier, the best ones I've tried are Print friendly & PDF (Chrome and Firefox) and Printable-The print doctor (Firefox only).

Software to easily generate PDF documents

There's software that makes it easier to generate One of these is DoPDF by Softland, a freeware product that I haven't tested myself, but here's the recommendation that was provided to me.

"It installs like a printer. Then you use the Menu > Print command in your browser to "print" the current web page to the doPDF "printer". The result is a PDF file, which should be a faithful representation of the appearance of any web page. doPDF even opens up the file for you automatically, as an option, in your favorite PDF reader (Foxit is a good reader, and can also print Web pages). doPDF has been available for many years, as has Foxit. Both are considerably better in many ways than the standard Acrobat "DC" reader, yet they are compatible".

IPFS distributed files system

IPFS stands for InterPlanetary File System, and it's a protocol to make web content decentralised, safer and faster to get. 2read is a browser add-on compatible with both Firefox and Chrome that allows to clean up the page similarly to the add-ons mentioned above. The export is cached on a server, but you can also rely on the emerging IPFS technology to "pin" that content locally. I've read something about it, it looks like a very promising technology to me, but it gets too technical for the problem I am trying to solve. For those interested, here's a couple of discussions on Hacker News on this: 

Convert article in current tab to readable form and upload it to IPFS

IPFS, The Interplanetary File System, Simply Explained

Skimming through the comments, it seems like the technology is still in its infancy despite significant founding, so it may take a while before it becomes more usable and bugs-free. Certainly to keep an eye on, as what it promises is enticing. As the time of writing, it can only serve static pages. This may change in the future, should the usage of this technology ramps up.

Save as Web_ARChive (WARP)

It gets me excited to learn that an open format exists for archiving high-fidelity, dynamic web content. WARP just does that. They've also created a relatively simple piece of software called Webrecorder. I've tried it out and while the technology seems quite powerful and effective, the user experience still need lots of improvement. Beyond that, I just want an easy way to save content from a web page to my computer, this software is something I have to open, copy the URL into, and then it's going to save archived content as a collection of obscurely-labeled files that can only be managed using the same software, or the leaner version of it (called Webrecorder player). Too impractical for me, and who knows what's going to happen in the future. On the positive side, besides the fact that you really get a faithful, interactive copy of what you want to save for late retrieval, is the fact that "WARC is now recognised by most national library systems as the standard to follow for web archival."

Polar app

Polar is a document manager that allows, among other things, to capture, annotate and highlight web pages, and save them locally so they are available offline. They seem to use a proprietary format, but according to what they say, contents can be exported to PDF. Check out the documentation page. I find it a very promising product, unfortunately after installing the desktop app and trying it out shortly, I had the impression that the product is barely usable and not mature yet. It works as a web application on both Firefox and Chrome. They also have an extension for the latter, but I did not manage to login because there's some technical issues. Will check again a few months from now.

There's likely a bunch of other apps out there that you might want to look for. Evernote for example offers a web clipping extension, I've never used it despite making a heavy usage of Evernote for archiving purposes, I just don't want the whole content of the pages I am saving to mix up with my notes, and saturate them with content that would then show up on a search query.

Web bundles

There's a promising technology called Web bundles that allows to share websites as a single file over Bluetooth and run them offline. As interesting as it sounds, this technology has not been adopted widely yet,

Aggregated HTML documents

When exporting a page to PDF is not an option because the markup doesn't allow to it, or I want to save it as it is (hoping it won't change too much in the future), I sometimes just save it as HTML. There's tools out there that allow to encapsulate HTML content so it's stored as a single file, rather than an HTML file and an attached folder with all resources. This is what MHTML does, but compared to the Web bundles mentioned above, it does not enable executable JS. Browser support is also scattered.

The SingleFile add-on (also available as a Chrome extension) allows to also save HTML content as a single file. I haven't tested it out because I am not sure how the technology behind works, and whether is reliable for future retrieval.

You have a solution that has not been mentioned here?

Should you have comments or suggestions, please add them to the Hacker News discussion mentioned above, so they are going to be available for future reference.