DataWeave - The Reduce Function

An explanation of the reduce function and how to use it in data transformations using DataWeave

DataWeave - The Reduce Function

This post will examine the reduce function in the DataWeave (DW) language. It will first illustrate reduce at a high level using a simple example. Then, it will explain how reduce works, what it expects as arguments, and how those arguments need to be constructed. Finally, we’ll dive into a more complex example, a real-world example that illustrates how reduce can be used to make existing map/filter code more efficient, and finally, an example of how we can use reduce to separate the concerns for counting the amount of times something occurs in an array. This post will also take the opportunity to highlight a few functional programming concepts along the way like higher-order functions, and immutable data structures. If you’re familiar with these concepts already, feel free to skip those sections, you won’t miss anything.

reduce is one of those functions in DW that doesn’t seem to get as much love as its companions, map and filter. Most people know about map and filter; use map to create a new array that’s the result of applying a transformation function to each element of the original array, and use filter to create a new array with elements from the original array removed according to a predicate function. When it comes to data transformation, it’s relatively easy to identify use cases for map and filter: mapping fields from one data source to another, or removing elements that shouldn’t go to a certain data source. They’re relatively specific functions (not to mention used all the time in Mule integrations and data sequences in general), so identifying when they should be used is straightforward once you understand what they do. Identifying where reduce comes into play is a little bit more difficult because it is incredibly general, and unfortunately most of the examples out there don’t really paint a great picture of what it’s capable of. Most of us who were curious about reduce in the past have already seen the addition example everywhere:

%dw 1.0
%ouput application/java

%var input = [ 1, 2, 3, 4 ]
---
input reduce ( ( curVal, acc = 0 ) -> 
  acc + curVal )

// Output: 10

If you’re not familiar, this code adds all the numbers in the array. Incredible!

The code is trivial, and chances are, you’re never going to simply add an array of numbers together in any code, but this example illustrates something important that I looked over for a long time, and maybe you did, too: reduce, like map and filter, takes an array and a function as input, but unlike map and filter, its primary job is to reduce all the elements in the array to a single element, where an element can be a number, string, object, or array (more on the last two, later). In this case, we have an array of numbers that’s reduced into a single number.

Let’s unwrap the mechanics of reduce before moving on to make sure we really understand how to use it. First things first, just like map and filter, reduce is a higher-order function. What’s a higher-order function, you ask? It is a function that takes another function as one of its inputs. reduce takes two parameters, on its left side it takes an array, and on its right side it takes a function that it will use to operate on each value of the array. The left side is trivial, the right side is where things can get confusing. The function passed to reduce needs to take two arguments, the first will represent the current value of the iteration, the second will be an accumulator (which could be anything: a number, object, array, etc). Just like it’s your responsibility to make sure the function to passed to filter returns a boolean, it’s your responsibility to make sure the function passed to reduce returns the new value for the accumulator. Let’s look at how the accumulator changes with each step through the input array by using the log function on the example above (more info on how log works here ). If you’re unclear of how reduce works, log will be your best friend when debugging reduce functions. We will also log the current value of the iteration.

%dw 1.0
%output application/java

%var input = [ 1, 2, 3, 4 ]
---
input reduce ( ( curVal, acc = 0 ) -> 
  log( ‘acc = ‘, acc ) + log( ‘curVal = ‘, curVal ) )

Here’s what the log looks like (formatted for clarity):

acc    = 0
curVal = 1

acc    = 1
curVal = 2

acc    = 3
curVal = 3

acc    = 6
curVal = 4

Keep in mind that in the above code, we’re logging acc before it is replaced by the expression acc + curVal. Let’s take that log file and look at pieces of it to see what reduce is doing:

acc    = 0
curVal = 1

0 + 1 = 1. What’s the next value for acc? 1!

acc    = 1
curVal = 2

1 + 2 = 3. What’s the next value for acc? 3!

acc    = 3
curVal = 4

By now you see where this is going.

Let’s make this example a little bit more complicated to illustrate that we can use something more complex than a number for the accumulator. What if we wanted to add all the even numbers together, add all the odd numbers together, and return both? First, we already know we’re going to need a container to hold the two values. Let’s decide now that for this we will use an object with two keys, odd and even. We’ll also create a function, isEven, to help future developers understand our code.
We’ll slap on the log now so we can see how the accumulator changes with each iteration

%dw 1.0
%output application/java

%var input = [ 1,  2,  3,  4,  5 ]

%function isEven( n ) n % 2 == 0
---
input reduce ( ( curVal, acc = { odd: 0, even: 0 } ) -> log(‘acc = ‘, {
   odd:  ( acc.odd  + curVal unless isEven( curVal ) otherwise acc.odd  ),
   even: ( acc.even + curVal when   isEven( curVal ) otherwise acc.even )
})

// Output: { odd: 9, even: 6 }

Here’s what the log file looks like:

acc = { odd: 1, even: 0 }
acc = { odd: 1, even: 2 }
acc = { odd: 4, even: 2 }
acc = { odd: 4, even: 6 }
acc = { odd: 9, even: 6 }

Since the array we passed to reduce alternates between odd and even numbers, the function we passed to reduce alternates between adding to the odd value and the even value as well. And notice that the function passed to reduce creates a new object to return as the accumulator every time. We’re not modifying the existing accumulator object. We couldn’t modify it even if we wanted to; data structures in DW are immutable by design. Avoiding modifying an existing object is an important functional programming concept; map and filter work the same way. This might seem confusing at first, but look at it this way: for reduce, the data that you return must be in the same shape as your accumulator. In the first example, our accumulator was a number, so we return a number. In this example, our accumulator was an object with two keys, odd and even, so we return an object with the keys odd and even.

Above are just pedagogical examples, though. How might we use reduce in the work place? A typical use case is to count the number of times something occurs (why “something” was italicized will be revealed later). Say we receive an array of payment transactions from a data source, and we want to know how many of these transactions were over a certain threshold, say, $100.00, and we want a list of all the merchants that charged us over $100.00, with no duplicates. The requirements dictate that this must all be in a single object. Here’s how we might do that without reduce:

%dw 1.0
%output application/java

%var input = [
  {
    "merchant" : "Casita Azul",
    "amount"   : 51.70
  },
  {
    "merchant" : "High Wire Airlines",
    "amount"   : 378.80
  },
  {
    "merchant" : "Generic Fancy Hotel Chain",
    "amount"   : 555.33
  },
  {
    "merchant" : "High Wire Airlines",
    "amount"   : 288.88
  }
]

%var threshold = 100

%function overThreshold( n ) n > threshold

%var transactionsOverThreshold = input filter overThreshold( $ )

%var merchants = transactionsOverThreshold map $.merchant disinctBy $
---
{
  count: sizeOf transactionsOverThreshold,
  merchants: merchants
}
// Output: 
// {
//   count: 3,
//   merchants: [ ‘High Wire Airlines’, ‘Generic Fancy Hotel Chain’ ]
// }

This is nice, and does the job quite well for a small input payload. But notice that we need to loop through the input payload once to filter out objects with amounts over the threshold, and then we need to loop through the resulting array again to map the values to get a list of merchant names, and then loop through that resulting array to filter out duplicate merchants. This is expensive! Since this is a real-world example, what if there were 400K records instead of just 4? At this point you might be thinking to yourself “I can just use Java instead, and I will only have to loop through the payload once with a for loop.” Not so fast! Don’t give up on DW just yet. What if we could use a single reduce instead of multiple map/filter combinations? Here’s what that would look like:

%dw 1.0
%output application/java

%var input = ... // Same as above except for 400K instead of 4

%var threshold = 100

%function overThreshold(n) n > threshold
---
input reduce ( ( curVal, acc = { count: 0, merchants: [] } ) -> ( {
  count: acc.count + 1 
  merchants: acc.merchants + curVal.merchant 
               unless acc.merchants contains curVal.merchant 
               otherwise acc.merchants
} ) when overThreshold( curVal.amount ) otherwise acc

// Output: 
// {
//   count: 3,
//   merchants: [ ‘High Wire Airlines’, ‘Generic Fancy Hotel Chain’ ]
// }

Much better. Now we can deal with everything we need to in one loop over the input payload. Keep this in mind when you’re combining map, filter, and other functions to create a solution: reduce can be used to simplify these multi-step operations and make them more efficient (thanks to Josh Pitzalis and his article ‘How JavaScript’s Reduce method works, when to use it, and some of the cool things it can do’, for this insight. Check out his article to see how you can create a pipeline of functions to operate on an element using reduce. It is very cool).

Notice that again we’re never mutating the accumulator, because data structures are immutable in DataWeave. We either pass on the existing accumulator (otherwise acc), or we create a new object and pass that on to the next step of the iteration. Also notice that we’ve reduced an array of elements into a single object, and built an array within the object in the process (because who says we can’t build while we reduce?).

Let’s simplify the problem above to illustrate another point. This time, we’ll only get the count of every transaction over $100.00. Counting the number of occurrences that something happens in an array is a very common use case for reduce. It’s so common that we should separate the concern of how to count from the concern of when to increment the counter. Here goes nothing:

%dw 1.0
%output application/java

%var input = ... // Same as above

%var threshold = 100

%function countBy( arr, predicate )
  arr reduce ( ( curVal, acc = 0 ) -> 
    acc + 1 when predicate( curVal ) otherwise acc )
---
{
  count: countBy( input, ( ( obj ) -> obj.amount > threshold )
}

// Output:
// {
//   count: 3
// }

Now we have a higher-order function, countBy, that takes in an array, and a function that defines exactly under what conditions we should increment the counter. We use that function in another higher-order function, reduce, which deals with the actual iteration and reduction to a single element. How cool is that? Now, with tools like readUrl, we can define the countBy function in a library, throw it into a JAR, and reuse it across all our projects that need it. Very cool.

The examples shown above do not use the default arguments to reduce’s function, $ and $$. I believe it’s easier to teach how reduce works by explicitly defining the parameters to the input function, but in some situations, this won’t work, and you’ll need to rely on the defaults. For example, let’s implement the function maxBy using reduce, which will get us the maximum value in an array according to a predicate function that defines what makes one value larger than another.

%function maxBy( arr, fn )
  arr reduce ( ( curVal, acc = 0 ) -> 
    curVal when fn( curVal, acc ) otherwise acc )

Do you see the problem here? We initialize the accumulator with 0, an integer. If we pass in the array [-3, -2, -1], and the function ((curVal, max) -> curVal > max), we’d expect a function called maxBy to return -1, but this one will return 0, a value that’s not even in the array, because curVal > max will return false for every element in the array. Even worse, what if arr wasn’t an array of numbers? We might try to get around this by doing this instead:

%function maxBy(arr, fn)
  arr reduce ((curVal, acc=arr[0])->
    curVal when fn(curVal, acc) otherwise acc)

which will work just fine, but at this point we might as well avoid getting the value by index and take advantage of the default arguments: $, which is the current value of the iteration, and $$, which is the current value of the accumulator. By default, $$ will be the initialized with the first value of the array passed in:

%function maxBy(arr, fn)
  arr reduce ($ when fn($, $$) otherwise $$)

The lambda ($ when fn($, $$) otherwise $$) can be explained as, “Set the accumulator ($$) to the current value ($) when the function fn returns true, otherwise, set the accumulator as the current accumulator.”

Before wrapping up there are three things I’d like to point out. First, we’ve seen that we can replace map/filter combinations with reduce, so it follows that we can implement both map and filter in terms of reduce. Here’s filter if you need proof:

%function filter(arr, predicate)
  arr reduce ((curVal, acc=[]) -> 
    acc + curVal when predicate(curVal) otherwise acc)

This means that there are times when you may try to use reduce where map or filter would be the more specific and appropriate tool to get the job done. Try not to use reduce in these circumstances, and instead reach for the more specific function. Your intent of your code will be more obvious, and future developers reading your code will thank you. I’ve found that a good rule of thumb is if I’m going to use reduce to reduce to an array chances are my intentions would be clearer using map or filter (but this is not always the case).

Second, these examples use the variables curVal and acc to denote the current value of the iteration, and the accumulator. I’ve used these names to help illustrate how reduce works. I do not recommend using these names when you write code. Use names that describe what you’re working with. For example, when trying to find the count of transactions over a threshold to generate a report like we did earlier, we might use trans and report instead of curVal and acc.

Third. This is more of general advice for consultants: reduce isn’t a concept that is easily understood by most programmers (I wrote this article with the purpose of better understanding how it works, myself), especially those that come from a Java/C++/C# background where mutable data structures and imperative looping constructs are the name of the game. Where I work, we have a multitude of clients, some heavily adopting Mule products across their organizations with years of internal Mule expertise, others having no internal expertise and needing just a few small integrations. As consultants, we need to leave these organizations with code that they can change, maintain, and expand on when we’re gone. Get a feel for the organization you’re working with, do the developers there understand functional concepts? Are most of them long-time Java programmers who’ve never seen reduce in their lives (assuming their not hip with Java 8 Streams, yet)? If you’re dealing with the former, using reduce to lessen the amount of code you need to write to accomplish a task is a good move; others might already understand your code, and if not, they have other people within their organization that can help. If you’re dealing with the latter, you’ll probably cause a fair share of headaches and fist-shakings at the clever code you left behind that the client is now having trouble understanding. Point being, reduce is not for everyone or every organization, and the client needs to come before the code.

In conclusion, we use reduce to break down an array of elements into something smaller, like a single object, a number, a string. But it also does so much more. reduce is an important and highly flexible tool in functional programming, and therefore, in DW as well. Like most highly flexible programming constructs, reduce is a double-edged sword, and can easily be abused, but when you use it right, there’s nothing quite like it.