Last week, we added functionality to one of our web apps to show just the main content of any web-page, without all the other stuff. You may think of this as creating a printable view of any web-page, with all images, videos, ads, etc. removed. Here is an example of an original webpage vs. the printable view we create:
Feel free to skip straight to our "Meat" Algorithm, as we've so endearingly named it, if you're not interested in the specifics of implementing it.
Thanks to Ruby and a Ruby gem, called Nokogiri, it's far easier to create this printable view than you may think. If you haven't heard of it before, Nokogiri is a gem that reads and parses HTML, XML, and SAX, and allows you to easily search and manipulate these documents based on CSS selectors and XPATH.
I should also note that Nokogiri also requires Open-URI, which is included in the standard Ruby library.
Nokogiri is really straight forward and easy to use. In fact, it's so easy, I'm going to show you how to open any webpage and create a print-formatted version of it in 5 lines of code!
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.example.com/some-page')) do |config|
config.noent.noblanks.noerror
end
doc.search("//script","//img","//iframe","//object","//embed","//param","//form","//meta","//link","//title").remove
doc.search("//div","//p","//span","//a","//h1","//h2","//h3","//h4","//h5","//h6","//ul","//ol").attr('class','').attr('id','').attr('style','')
doc = doc.search("//p").collect{ |p| p.parent }.uniq
Ok, so now you may be asking, what exactly is this code doing? Let's take a look, piece-by-piece...
doc = Nokogiri::HTML(open(h'http://www.example.com/some-page')) do |config|
config.noent.noblanks.noerror
end
This config block for Nokogiri is simply stripping away blank nodes, entities, and suppressing any errors generated from malformed HTML. You can read more about these configuration options on Nokogiri's site.
doc.search("//script","//img","//iframe","//object","//embed","//param","//form","//meta","//link","//title").remove
doc.search("//div","//p","//span","//a","//h1","//h2","//h3","//h4","//h5","//h6","//ul","//ol").attr('class','').attr('id','').attr('style','')
This line is simply removing all javascript, images, iframes, and embeded objects such as videos and flash. You can modify this to remove pretty much any elements you want from the page.
UPDATE: Since writing this article, we've refined our "algorithm" to strip out more element types. We also added the second line, which overrides the HTML attributes for `class`, `id`, and `style` so that the resulting HTML does not cause unintentional styling conflicts.
doc = doc.search("//p").collect{ |p| p.parent }.uniq
This, believe it or not, is our algorithm for determining what is the important part of the webpage to keep. This is how we determine what the "meat" of the page is, the part that needs to be kept.
All we are doing is searching the document for `<p%gt;` paragraph tags. Then we collect the parent `<div>` blocks (or whatever the parent elements of the `<p>` tags happen to be) into an array. However, if there are multiple `<p>` tags in one block (and let's face it, there will be), this creates duplicate parent blocks in the array (one for each `<p>` tag). So, we simply call the `.uniq` method on the array to get rid of the duplicates.
Note the reason we grab the parent element of all paragraph elements (rather than just grabbing all of the paragraph elements themselves) is so that we make sure not to exclude any headings, unordered lists, ordered lists, or any other elements that may be in the body of the page, but not wrapped in paragraph tags.
This is not full-proof, as there could be a webpage that throws semantics and proper markup to the wind, and decides that plain text and `<br />` tags are better than `<p>` tags. This may also exclude other important information elsewhere in the webpage. However, in our limited experience so far, it works very well in at least 95% of scenarios.
Comments are loading...