Fun With Enumerators: pagination

Summary

Seamlessly enumerate over paged content (such as an api index route) using a simple .each call. Pages are only fetched as needed, and all other chainable enumerators also work as expected.

For example:

data = PaginatedEnumerator.new{|page| JSON.load(`curl api.foo.com?page=#{page}`)}
data.each{|thing| puts thing}

Skip to tl;dr

Enumerators and Enumerables

Enumerators and the Enumerable module in Ruby are very richly featured and, in my opinion, among its most appealing design choices (although I do confess some love for Python’s iterators, as well). Using these features effectively often makes up a good deal of low hanging fruit in code refactoring code for performance and readbility (simultaneously!), and so is very worth some exploration! Let’s quickly start at the beginning.

Enumerable is simply a module specifying a collection of methods such as each, map, select, and so on. Classes that include Enumerable are what you probably think of when you think of enumerating, such as Array and Hash.

Enumerators are object you are actually dealing with every time you call these methods. Ruby abstracts away its details very neatly in the Enumerable methods signatures such that you’re rarely dealing with them directly. But Enumerator are what allow you to chain these methods such as in data.map.with_index{|obj, ii| "#{ii}: #{obj}, and how we can get the awesomeness of lazy enumerators.

We’re going to get a little practice leveraging Enumerators by implementing an Enumerable class that abstracts away the details of fetching bufferred or paginated content. The use case that inspired building this class, for example, was a routine to loop through and selectively pull media content from various hosts’ API calls that listed albums. For some of the APIs, but not others, these albums were returned in pages. We wanted to build a common structure where the logic for looping over the content itself did not have to be made aware of the pagination; it could just data.each and the details of how to get the next item was up to the object passed to it.

PaginatedEnumerator

We’re going to build a class called PaginatedEnumerator. Despite some advice to the contrary, we’re going to include Enumerable here, because this is intended to be a very general use class and I want the users of this class to be able to call select, map, reduce, find, and, yes, even minimax if they so choose.

The user will then be able to use it like so:

data = PaginatedEnumerator.new{...}
data.each{|item| do_something_with(item)}

And that’s it! The key is the block provided to new – this needs to tell our PaginatedEnumerator it how to fetch the next page. With that, any calls to each, or minimax, or any other Enumerable method can ignore the details of pagination. So let’s flesh out what that block should look like.

Fetching the page

A good first spec for how the user can tell us how to fetch a page is with a block that takes a page number as an argument. So this might look like

data = PaginatedEnumerator.new{|page| fetch_page(page)}

where fetch_page(page) makes a web request with the appropriate parameters. A key detail here is that fetch_page will only be called only when needed, so a loop won’t immediatly try to make 100 API calls, or however many are neede to reach the end of the data.

Speaking of the end of the data, let’s add one more detail to our spec: if there is no data left, the block should return an empty Array. That way we know when we’re done. With Ruby’s Enumerators, the pattern we will follow is to raise StopIteration when we observe this condition.

Sketching out the class

In pseudocode, the class should look roughly like

class PaginatedEnumerator
  include Enumerable

  def initialize(&pager)
    # store the block, which specifies how to get the next page
    @pager = pager
    @buffer = []
  end

  def each
    loop do
      @buffer += fetch_page(current_page) if need_page?
      raise StopIteration if @buffer.empty?
      yield @buffer.shift
    end
  end
end

Actually, this isn’t too far from the final code. This is why I love Ruby.

In the initialize method, we store the block (called &pager), which should look like what we said above: it takes a page number as an argument, and returns the data at that page. We also initialize a @buffer, which stores the values we’ve already fetched from the API.

Ok, but how do we actually implement fetch_page? This is actually quite easy with Ruby Blocks and Procs:

def fetch_page(page)
  @pager.call(page)
end

So let’s keep an index, @page, that we’ll initialize to 0 by default, but allow the user to override.

  def initialize(page: 0, &pager)
    @pager = pager
    @page = page
    @buffer = []
  end

We’ve define each, as Enumerables are wont to do. Given the Enumerable module, we now get, for free, all of our favorite Enumerble methods! Well, except for one detail: we need to return an Enumerator if there is no block given to each. This is as simple as

def each
  return enum_for(:each) if !block_given?
  ...
end

Returning the Enumerator provided by enum_for allows us to call this method without a block and chain enumerators, pass them around without evaluation, or create lazy enumerators.

Do the loop

The loop and yield pattern is the heart of enumeration in Ruby. Given a block, such as that in data.each{|x| puts x}, the yeild is what executes the code in that block: yield(value) calls puts x with value in the place of x.

But we need a stop condition. As mentioned above, our requirement is that your block given to PaginatedEnumerator.new returns [] when there are no values left. It’s nice to formalize this requirement to isolate our class from the innumerable (see what I did there?) ways the inner code may behave when its out of data (nil, [], false, raise…). Thus we leave that knowledge up to the caller. We simply respond to this condition of no more data ([]) by raising StopIteration, which is an Enumerator’s signal to stop looping.

So that builds out out each method like so:

def each
  return enum_for(:each) if !block_given?

  loop do
    # Fetch a page if we're out of data
    @buffer += fetch_page if page_needed?
    raise StopIteration, "No more data" if @buffer.empty?
    yield @stored_values.shift
  end
end

Putting this all together, we have a nice, concise, and very useful class!:

class PaginatedEnumerator
  include Enumerable

  def initialize(page: 0, &blk)
    raise ArgumentError, "Block required to define how to fetch new records" if blk.nil?
    raise ArgumentError, "Block needs to take exactly one argument (current page)" if blk.arity != 1

    @pager = blk
    @page = page
    @buffer = []
  end

  def each
    return enum_for(:each) unless block_given?

    loop do
      if @buffer.empty?
        @buffer += fetch_page(@page)
        raise StopIteration if @buffer.empty?
        @page += 1
      end
      yield @buffer.shift
    end
  end

  private

  def fetch_page(page)
    @pager.call(page)
  end
end

In action

Let’s use it!

3.3 > ee = PaginatedEnumerator.new do|page|
3.3 >   puts ">>> FETCHING page #{page}"
3.3 >   if page > 3
3.3 >     puts ">>> Out of data!"
3.3 >     []
3.3 >   else
3.3 >     # Dummy content
3.3 >     ('a'..'c').map{|char| "#{page}:#{char}"}
3.3 >   end
3.3 > end
# => #<PaginatedEnumerator:0x007fa6b5a90f48 @pager=#<Proc:0x007fa6b5a90ef8@(irb):30>, @page=0, @buffer=[]>
3.3 > ee.each.with_index{|x, ii| puts "#{ii}: Next value: #{x}"}

### output
>>> FETCHING page 0
Next value: 0:a
Next value: 0:b
Next value: 0:c
>>> FETCHING page 1
Next value: 1:a
Next value: 1:b
Next value: 1:c
>>> FETCHING page 2
Next value: 2:a
Next value: 2:b
Next value: 2:c
>>> FETCHING page 3
Next value: 3:a
Next value: 3:b
Next value: 3:c
>>> FETCHING page 4
>>> Out of data!

Notice how “FETCHING page n” is not printed until that page is reached; we are able to interact with each page’s data before fetching the next page. We can also easily use this with Ruby’s lazy enumerators:

2.3.3 > ee.lazy.select{|x| x.last == "c"}.first(2)
>>> FETCHING page 0
>>> FETCHING page 1
 => ["0:c", "1:c"]

Note that select use on a non-lazy enumerator forces evaluation of the entire Array, and this behavior is no different with our enumerator:

2.3.3 > ee.select{|x| x.last == "c"}.first(2)
>>> FETCHING page 0
>>> FETCHING page 1
>>> FETCHING page 2
>>> FETCHING page 3
>>> FETCHING page 4
>>> Out of data!
 => ["0:c", "1:c"]

Because of the magic of lazy, only the pages needed to select the first 2 results that matched the select condition were ever called. Wonderful!