Data Pipeline

Data Pipeline Part II: Map, Filter, Reduce

Illustrated by Bulma Juozaityte

Quite often in our programs, we manipulate data collections by applying a chain of transformation to our data until we extract from the data the result of a meaningful calculation. This chain of calculation could run on a single process or could be distributed to a cluster when we have to deal with big data.

We call this chain of calculation a data pipeline. The building blocks of data pipelines are the three fundamental operations: map, filter, and reduce.

Mapping is when we apply a transformation to each element of a sequence. In Clojure, the transformation is defined as a function that takes one argument. The transformation function operates only on the current element of the sequence. Don’t confuse map the noun, which is a data structure, and map the verb, which is a function.

Filtering is when we retrieve the elements of a sequence that fit some criteria. In Clojure, the criteria is defined as a function that takes one argument and returns a truthy value (usually true or false).

Reducing is when we go over a sequence and calculate a value that is aggregated over the sequence. This value could be a simple one, like a number, or a complex one, like a map.

In a previous article, we presented the sequence abstraction and the map operation. In this article, we are going to explore filter and reduce and build a mini data pipeline that combines map, filter, and reduce.

Filtering

Filtering is quite simple: we start from a sequence and return a sequence made only of the elements of the sequence that fit some criteria. The criteria is defined by a function that receives an element and returns either true or false. Such a function is called a predicate.

Let’s look at a simple example where we filter the odd numbers from a sequence of numbers:

user=> (filter odd? [1 2 3])
(1 3)

filter receives two arguments: a function pred and a sequence s. The sequence returned by filter is always a subsequence of the original sequence. The length of the returned sequence is always less than or equal to the length of the original sequence.

Cracking the Code of filter

Like we saw in the previous article about the map function, it’s quite amazing that one can write the code for the filter function and that it will work with any sequence.

Let’s create the filter—it’s just a few lines of code:

(defn my-filter [pred s]
  (if (empty? s) 
    s
    (if (predicate (first s))
      (cons (first s) 
            (my-filter predicate (rest s)))
      (my-filter predicate (rest s)))))

Let’s check ourselves:

user => (my-filter odd? [1 2 3])
(1 3)

A few words about the code of my-filter:

  1. When the sequence s is empty, we return it.
  2. When the predicate returns true on the first element of s, we prepend the first of element of s to the result of a recursive call to my-filter with the same predicate and the rest of s.
  3. When the predicate returns false on the first element of s, we return the result of a recursive call to my-filter with the same predicate and the rest of s (without prepending the first element of s).

Reducing

The reduce operation is a bit more complicated to explain than map and filter. The reduce operation is sometimes called aggregate or inject.

Basically, reducing a sequence means to go over a sequence and calculate a value that is aggregated over the sequence. The calculation to execute on each element is expressed by a reducing function that receives two arguments:

  1. The aggregated value so far
  2. The current element of the syntax

The order of the calculation is like this: On each element of the sequence, the reducing function is called with acc and elem, and the result becomes the aggregated value for subsequent iterations. The value returned by the last call to the reducing function becomes the result of the whole reduce operation.

Lots of pieces of code that you are used to writing with for loops could be translated into a reduce operation.

Let’s take a look, for instance, at a for loop in C-like code that calculates the maximal value of a sequence:

max = 0
for(i = 0; i < arr.length; i++) {
      max = arr[i] > max ? arr[i] : max;
} 

Here is the corresponding reduce piece of code:

(reduce (fn [acc elem]
          (if (> elem acc) elem acc))
        0
        arr)

The reducing function is an anonymous function that receives acc elem and returns either acc or elem.

The second parameter to reduce (in our example it’s 0) is the initial result, which is the value that is passed to the first call of the reducing function.

Let’s write explicitly the flow of the reduce operation, step by step, when we pass the sequence [1 9 3]:

(reducing-fn 0 1) ;; the result is 1
(reducing-fn 1 9) ;; the result is 9
(reducing-fn 9 3) ;; the result is 9

| Iteration | aggregated value | element | result |
+-----------+------------------+---------+--------+
| #1 | 0 (the initial result) | 1 | 1 |
| #2 | 1 | 9 | 9 |
| #3 | 9 | 3 | 9 |

Notice how the result of each iteration becomes the aggregated value of the subsequent iteration. For instance, the result of iteration #2 is 9, which is also the aggregated value for iteration #3.

Cracking the Code of Reduce

You might think that reduce is complicated to implement, but actually it’s quite straightforward. Like we saw in the previous section about the filter function, it’s quite amazing that one can write the code for the reduce function and it will work with any sequence.

Let’s write the reduce function—it’s just a few lines of code:

(defn my-reduce [reduce-fn res s]
  (if (empty? s)
      res
      (my-reduce reduce-fn
                 (reduce-fn res (first s))
                 (rest s))))

Let’s check ourselves:

user => (my-reduce (fn [acc elem]
               (if (> elem acc) elem acc))
            0
           [1 9 3])
9

A few words about the code of my-reduce:

  1. When the sequence s is empty, we return res, which is the result of the reducing calculation so far.
  2. Otherwise we make a recursive call to my-reduce with the following:
  • a. the same reducing function, reduce-fn
  • b. the result of calling reduce-fn with the result so far and the first element of s
  • c. the rest of s

A Mini Data Pipeline

Let’s conclude this article with a mini data pipeline—a short piece of code that that uses together map, filter, and reduce.

Our initial sequence is a sequence of fractions, and we need to return the greatest denominators of the fractions in the sequence whose denominator is odd.

We are going to implement this by a chain call to map, filter, and reduce based on the flow presented in the following diagram:

Let’s code our mini data pipeline, piece by piece!

Clojure provides a denominator function that returns the denominator of a fraction.

Let’s start with the sequence [1/3 1/2 3/5 5/7] and map each element of the sequence to its denominator:

user => (map denominator [1/3 1/2 3/5 5/7])
(3 2 5 7)

Now, we want to filter the odd elements of the sequence:

user => (filter odd? [3 2 5 7])
(3 5 7)

Finally, we calculate the greatest element of the sequence with reduce:

user=> (reduce (fn [acc elem]
                 (if (> elem acc) elem acc))
               0
              [3 5 7])

Let’s combine those three pieces of code in a map-filter-reduce chain:

user=> (reduce (fn [acc elem]
                  (if (> elem acc) elem acc))
               0
              (filter odd? 
                      (map denominator [1/3 1/2 3/5 5/7])))
7

This kind of code quickly becomes hard to read when we combine all the pieces in a single piece of code. Fortunately, Clojure provides syntactic sugar—namely a thread macro—that allows us to write the same code in a more readable manner, like this:

user=> (->> [1/3 1/2 3/5 5/7]
            (map denominator)
            (filter odd?)
            (reduce (fn [acc elem]
                        (if (> elem acc) elem acc))
                    0))
7

Threading macros is an interesting topic for another article.

Conclusion

In this two-part series about data pipelines, we have seen how the sequence abstraction allows us to handle, in a common way, different types of data collections. The basic data manipulation operations on sequences are map, filter, and reduce. Here is a summary of the meaning of each operation:

  • Map is applying a transformation to each element of a sequence.

  • Filter is creating a subsequence made only of the elements of the sequence that fit some predicate.

  • Reduce is going over a sequence and calculating a value that is aggregated over the sequence.

This article explained in detail the meaning of these three operations and how to use them in Clojure. The same ideas exist, with slight differences, in different programming languages.

Yehonathan Sharvit

author

Yehonathan has been in the Hi-Tech since 2000.

He has worked in various management positions from team leader to VP R&D, mostly in startups, combining strong technical knowledge with management skills. His management style is mostly influenced by Agile methodology and Lean Startup philosophy.

As a functional programming expert, Yehonathan is a regular speaker at Tech events around the world and he is the maintainer of Klipse https://github.com/viebel/klipse - a popular github open source project. He blogs about programming languages at http://blog.klipse.tech.

He has strong academic background: M.Sc. in Mathematics and B.Sc. in Electrical Engineering from Technion institute in Israel.

Bulma Juozaityte

illustrator

Bulma is multi-talented, freelance illustrator with a quirky, colourful style. Bulma loves animation and predominantly illustrates for motion graphics, using her creativity to help tell stories for brands. Originally from Lithuania, and having studied and worked in the UK, Bulma now enjoys island life on Malta. There, she learned to swim and dove even deeper into illustrating. Bulma is always looking for new challenges and opportunities.