Code Design

Data Pipelines Part I: The Sequence Abstraction

Illustrated by Julia Hanke

Among the many amazing features of functional programming, or any language that supports lambdas, is the ability to express complex function calls with very concise and... human readable code. And what better use case than the very useful collection pipeline?

But to implement a collection pipeline, we need two main things: a common way for functions to access and manipulate collections and a library of said functions.

In this first part of our two-part series, we will be talking about the common interface that collections have in Clojure and how we can implement methods that take advantage of it. In part two, which is scheduled for the next issue, we'll be looking at the most used functions: Map, Filter, and Reduce.

Data Collections and the Sequence Abstraction

Depending on the way we want to retrieve data from a data collection, we might choose different internal representations of the data to optimize either the memory or the computation performance.

In Clojure, most data manipulation functions work on all the data collections, no matter what the type of the collection is. The same function works on a vector, a map, a list, or a range of numbers.

Clojure treats all data collections as sequences.

The Fundamental Idea of Data Manipulation

Think about all the numbers between 0 and 99.

What data structure are you going to choose to represent the numbers? Will you use an array, a vector, a linked list, a set, or another data structure?

The answer depends on the use case. Depending on your use case, you might choose:

  • a set if the order of the elements doesn't matter and you need to be able to determine quickly whether a number belongs to the data structure;
  • an array if the size is fixed and you need to access elements by their index;
  • a vector if the size is dynamic and you need to access elements by their index;
  • a linked list if you only need to access the elements in order.

You might even think of a more lazy data structure in which you don't store the elements themselves but only the first element and a function that generates the next element. (In our case, the first element would be 0 and the function that generates the next element would be increment.)
No matter what your choice is, let's give a name to your data structure. We will call the sequence S.
Now, how will you use sequence S to generate a sequence V that is made of the elements of S where each element is multiplied by 10? In other words, how will you generate the sequence made of 0, 10, 20, 30, and so on out of S?
Can you generate the vector V out of S with zero knowledge about the data structure chosen to represent the numbers between 0 and 99?
The answer is yes. The only thing you need is a way to traverse the elements of S in some order. A data structure whose elements can be traversed in some order is called-you guessed it-a sequence!

More precisely, we traverse the elements of a seqable collection using the four basic sequence functions:

  1. first retrieves the first element of the sequence.
  2. rest returns a sequence of the items after the first element.
  3. cons returns a new sequence where an element is prepended to the provided sequence.
  4. empty? returns true when the sequence is empty and false otherwise.

The fundamental idea of data manipulation in Clojure is that using only first, rest, cons and empty? we can write lots of data manipulation functions. For instance, we can write a function that receives a sequence s and returns a sequence that is made of the elements of s where each element is multiplied by 10. Here is the code for such a function:

(defn inc-all [s]
  (if (empty? s) 
    s
    (let [new-first (inc (first s))
          new-rest (inc-all (rest s))]
      (cons new-first new-rest))))

You don't need a deep understanding of the recursive algorithm1 implemented by this function. The important thing to notice is that the code doesn't access the internal structure of the sequence, but only uses the four sequence functions: first, rest, cons and empty?. As a consequence, mult-by-ten works with any data structure that implements the four basic sequence functions, including a vector, a list, a set, or any other data structure.
In Clojure, we have dozens of data manipulation functions that operate on every sequence, and this is made possible by the fact that the functions use only those four basic sequence functions. For Clojure developers, this is highly valuable as it allows them to write their own data manipulation functions by composing the functions provided by Clojure, and our functions also operate on every sequence with no special effort. This is the power of true polymorphism.

In this first part of our two-part series, we will explore the most common data manipulation functions to give you an idea of the wide variety of algorithms these simple four functions enable. In part two, we will look at the most significant one: map-reduce.

Range of Numbers

In Clojure, the simplest way to generate a range of numbers is through the range function.
Let's see range in action:

user=> (range 10)
(0 1 2 3 4 5 6 7 8 9)

The exact type of the sequence returned by range is not important at all. The only thing that you need to know is that it is a sequence. Therefore it is well handled by the four basic sequence functions: first, rest, cons and empty?. By default, sequences are represented with wrapping rounded parentheses.
Let's see the three basic sequence functions in action on a range of numbers.
The first element of (range 10) is 0:

user=> (first (range 10))
0

rest returns the sequence without the first element. Therefore when we call rest on (range 10), we get a sequence that starts with 1 and ends with 9:

user=> (rest (range 10))
(1 2 3 4 5 6 7 8 9)

Obviously, (range 10) is not empty:

user=> (empty? (range 10))
false

Maps as Sequences

maps are associative data collections in the sense that inside a map, each key is associated with a value. Let's look at a map that describes a person:

{:first-name "Kelly"
 :last-name "Kapowski"}

Let's compare this map to a nested vector made of pairs that describes the same person:

[[:first-name "Kelly"]
 [:last-name "Kapowski"]]

Although the information is represented differently by the map and the vector, from a purely logical standpoint, the map and the vector contain the same information. The information is:

  1. the key :first-name is associated with the value "Kelly"
  2. the key :last-name is associated to the value "Kapowski"

Conceptually, we could say that a map is a sequence of key-value pairs. In a sense, a map can be considered a sequence.
In Clojure, we cannot access a map as a sequence; we need to use the seq function to transform it into a sequence of key-value pairs:

user=> (seq {:first-name "Kelly"
    =>       :last-name  "Kapowski"})
([:first-name "Kelly"] [:last-name "Kapowski"])

Once a map is converted to a sequence, we can use on it all the sequence functions:

user=> (first (seq {:first-name "Kelly"
    =>              :last-name "Kapowski"}))
[:first-name "Kelly"]

user=> (rest (seq {:first-name "Kelly"
    =>             :last-name "Kapowski"}))
([:last-name "Kapowski"])

user=> (empty? (seq {:first-name "Kelly"
    =>               :last-name "Kapowski"}))
false

Now it's time to unveil the real power of sequences. Clojure provides a rich set of functions to manipulate sequences in many different ways: mapping, filtering, sorting, grouping, combining and more. When you get used to those functions, you can build data transformation pipelines with a few lines of code.

Overview about Mapping

Mapping is when we apply a transformation to each element of a sequence. In Clojure, the transformation is defined as a function that takes one argument. The transformation function operates only on the current element of the sequence. Don't confuse map as a noun, which is a data structure, with map as a verb, which is a function.
Clojure provides a rich set of functions from each family. This abundance of functions makes it fun for the Clojure developers to write code for many data pipeline tasks. Of course, it is fun only after you understand when and how to use those functions.
Let's focus for now on the most common mapping function available in Clojure: map. We will explore filtering and reducing in another article.
The most famous expression that involves the map function is:

(map inc [1 2 3])

It returns a sequence of the numbers 1, 2, and 3, incremented by 1:

user=> (map inc [1 2 3])
(2 3 4)

map receives two arguments: a function func and a sequence s. It returns a sequence consisting of the result of applying func to each element of s.

map breakdown
map breakdown

The expression (map inc [1 2 3]) returns a sequence where the inc function is applied to each element of [1 2 3]. The result is the sequence (2 3 4).
It might help to think about the result of (map inc [1 2 3]) as a sequence made up of (inc 1), followed by (inc 2), followed by (inc 3). It is important to pay attention to two properties of the sequence returned by map:

  1. The order of the elements in the sequence returned by map corresponds to the order of the elements in the original sequence.
  2. The length of the sequence returned by map is the same as the length of the original sequence.
map example
map example

Because of the power of the sequence abstraction, map is able to handle any type of sequence: vectors, lists, sets, and so on.

user=> (map inc (set [1 2 3]))
(2 3 4)
user=> (map inc (list 1 2 3))
(2 3 4)

The result is always a sequence!
It means that we are allowed to call map on the sequence returned by map. For example, we can increment again:

user=> (map inc (map inc [1 2 3]))
(3 4 5)

Cracking the Code of map

It's quite amazing that one can write the code for the map function and it will work with any sequence.
Let's do it---it's just a few lines of code:

(defn my-map [func s]
  (if (empty? s) 
    s
    (let [new-first (func (first s))
          new-rest (my-map func (rest s))]
      (cons new-first new-rest))))

Let's check ourselves:

user => (my-map inc [1 2 3])
(2 3 4)

Conclusion

In this brief overview of the concept of sequences, we have illustrated a principle that appears often in Clojure: with simplicity comes power.
The concept of sequence is so simple---it's just four basic functions, but the possibilities are endless.

Footnotes

1: There is a fun book that explores the many possibilities of these kinds of recursive algorithms. It's called "The Little Schemer" by Daniel P. Friedman and Matthias Felleisen.

Yehonathan Sharvit

author

Yehonathan has been in the Hi-Tech since 2000.

He has worked in various management positions from team leader to VP R&D, mostly in startups, combining strong technical knowledge with management skills. His management style is mostly influenced by Agile methodology and Lean Startup philosophy.

As a functional programming expert, Yehonathan is a regular speaker at Tech events around the world and he is the maintainer of Klipse https://github.com/viebel/klipse - a popular github open source project. He blogs about programming languages at http://blog.klipse.tech.

He has strong academic background: M.Sc. in Mathematics and B.Sc. in Electrical Engineering from Technion institute in Israel.

Julia Hanke

illustrator

Julia Hanke is an illustrator living in Warsaw, Poland. She worked in creative agencies, currently works as fulltime freelance Illustrator, mainly making Illustrations for animations and web design. Now she is shifting her focus on editorial and children's book Illustrations. You can follow her on instagram @julia_hanke.