Fun with Python generators: group_by
A simple generator demo to build a group_by function
We’re going to develop our understanding of Python generators by using one to solve a simple and general problem.
The problem
Given an array or iterable object, loop over groups of consecutively matching elements.
For example, for [1, 1, 1, 2, 3, 3]
, we want to process [1, 1, 1]
, [2]
, and finally [3, 3]
If we had more 1
s at the end, they would be their own group:
[1, 1, 2, 1]
=> [1, 1]
, [2]
, [1]
.
This is already solved by itertools
:
import itertools
array = [1, 1, 1, 2, 2, 3, 1, 1]
[(val, list(items)) for val, items in itertools.groupby(array)]
#=> [(1, [1, 1, 1]), (2, [2, 2]), (3, [3]), (1, [1, 1])]
Let’s implement this ourselves using a Python generator!
The solution
We’re going to keep it simple and use a list as input and not bother with other iterators. The same logic will apply, it will just require more bookkeeping to deal with looking at next values and handling StopIteration
exceptions.
So our approach will be to keep a start and end index in our generator and return each group as a slice of the list according to these indices.
- Start with indices at 0.
- While we have elements left to look at, we’ll check the next value, and increment the end index until the value no longer matches (or we’ve run out of elements).
- We’ll then
yield
the array slices by the current start and end indices.yield
ing is the crux of generators, and each call toyield
defines the next value you’ll get out of your generator. - Finally, we’ll set the start index to the next (non-matching) element and repeat until finished.
def group_generator(array):
start_i = 0
end_i = 0
while start_i < len(array):
val = array[start_i]
while end_i < len(array) and array[end_i] == val:
end_i = end_i + 1
yield array[start_i:end_i]
start_i = end_i
Result:
array = [1, 1, 1, 2, 2, 3, 1, 1]
groups = group_generator(array)
list(groups)
# => [[1, 1, 1], [2, 2], [3], [1, 1]]
It’s easy enough to modify the return structure to match the itertools
function to have a tuple of (key, values)
, so I won’t do so here. And we’re return lists instead of the itertools._grouper
iterator as the values because we’re keeping it simple with lists.
Added feature: group_by key
Let’s finish by adding one useful feature to our generator: the ability to group by something other than the element value itself. This is useful if you have more complex data structure with a key, since as timestamp or id, that you want to use to group otherwise non-identical elements.
To do this, we’ll define an optional key
argument, and use it to compare values instead of our current equality check.
def group_generator(array, key=None):
def _equal(one, two):
if key == None:
return one == two
else:
return key(one) == key(two)
ii = 0
jj = 0
while ii < len(array):
val = array[ii]
while jj < len(array) and _equal(array[jj], val):
jj = jj + 1
yield array[ii:jj]
ii = jj
Let’s consider the data structure
data = [
{'k': 1, 'n': 0},
{'k': 1, 'n': 1},
{'k': 1, 'n': 2},
{'k': 2, 'n': 3},
{'k': 3, 'n': 4},
{'k': 3, 'n': 5},
{'k': 2, 'n': 6},
{'k': 2, 'n': 7},
]
Result: without using a key
list(group_generator(data))
#=> [ [{'k': 1, 'n': 0}],
# [{'k': 1, 'n': 1}],
# [{'k': 1, 'n': 2}],
# [{'k': 2, 'n': 3}],
# [{'k': 3, 'n': 4}],
# [{'k': 3, 'n': 5}],
# [{'k': 2, 'n': 6}],
# [{'k': 2, 'n': 7}]]
[len(group) for group in group_generator(data)]
#=> [1, 1, 1, 1, 1, 1, 1, 1]
Result: using a key
list(group_generator(data, key = lambda x: x['k']))
#=> [ [{'k': 1, 'n': 0}, {'k': 1, 'n': 1}, {'k': 1, 'n': 2}],
# [{'k': 2, 'n': 3}],
# [{'k': 3, 'n': 4}, {'k': 3, 'n': 5}],
# [{'k': 2, 'n': 6}, {'k': 2, 'n': 7}]]
[len(group) for group in group_generator(data, key = lambda x: x['k'])]
#=> [3, 1, 2, 2]
In Sum
There you have it, a simple method that uses Python generators to do a simple thing. Hopefully this helps you understand python generators and how to use them just a little bit better.