A generic merge function

When working with sorted lists you often come to the point where you want to combine two or more of them. This merge procedure forms the heart of merge sort it works something like:

merge [1,3,4,5] [2,3,4] = [1,2,3,3,4,4,5]

This merge function is not in the Haskell standard library, and even if there were, it might not be very useful.

The problem is that when you need merge you often need a slight variation. For example, you might want to remove duplicates,

merge_union [1,3,4,5] [2,3,4] = [1,2,3,4,5]

Or find the elements common to both lists,

merge_intersection [1,3,4,5] [2,3,4] = [3,4]

Or you want the difference, the symmetric difference, or...


The solution for all these problems is to make a more general merge function. To do that we take a note from the most generic function over a single list, foldr. The generic merge function is also a right fold, but over two lists. Behold the type signature:

mergeByR :: (a -> b -> Ordering)  -- ^ cmp: Comparison function
         -> (a -> b -> c -> c)    -- ^ fxy: Combine when a and b are equal
         -> (a -> c -> c)         -- ^ fx:  Combine when a is less
         -> (b -> c -> c)         -- ^ fy:  Combine when b is less
         -> c                     -- ^ z:   Base case
         -> [a] -> [b] -> c       -- ^ Argument lists and result list

Don't be scared by the size. The reason there are a lot of arguments is that for each case we use a different combining function: If the smallest element comes from the first list we use fx, if it comes from the second list we use fy, and when the two elements are equal, we combine them both with fxy. As in foldr these calls to fx/fy/fxy are then chained like fx x1 (fx x2 (.. z)).

The lists from the example above can be aligned as follows:

xs               =  [1,      3,   4,   5 ]
ys               =  [    2,  3,   4      ]
function to use  =  [fx, fy, fxy, fxy, fx]
mergeByR ....    =  fx 1 . fy 2 . fxy 3 3 . fxy 4 4 . fx 5 $ z

The function implementation is straightforward:

mergeByR cmp fxy fx fy z = go
    where go []     ys     = foldr fy z ys
          go xs     []     = foldr fx z xs
          go (x:xs) (y:ys) = case cmp x y of
              LT -> fx  x   (go xs (y:ys))
              EQ -> fxy x y (go xs ys)
              GT -> fy    y (go (x:xs) ys)


Now, let's look at some uses of this function. First of all, the usual merge sort merge function:

mergeBy cmp = mergeByR cmp (\a b c -> a:b:c) (:) (:) []
merge = mergeBy compare

Instead of adding both a and b to the resulting list when they are equal, we can instead add only one of them, or even the result of some function on them. This gives the set union operation:

unionByWith cmp f = mergeByR cmp (\a b c -> f a b:c) (:) (:) []
unionWith = unionByWith compare

If we ignore elements that occur in only one of the lists by setting fx and fy to const id, we get the intersection instead:

intersectionByWith cmp f = mergeByR cmp (\a b c -> f a b:c) (const id) (const id) []
intersectionWith = intersectionByWith compare


With these merge functions, implementing merge sort becomes simple. All that is left to do is split a list in two, and recursively sort and merge.

split :: [a] -> ([a],[a])
split (x:y:zs) = let (xs,ys) = split zs in (x:xs,y:ys)
split xs       = (xs,[])
sort [] = [] sort [x] = [x] sort xs = let (ys,zs) = split xs in merge (sort ys) (sort zs)

If we replace merge by unionWith we instead get a sort that combines duplicate elements.


Besides set operations, mergeByR can also be (ab)used for other things, such as

zipWith = intersectionByWith (const $ const EQ)

Or a variant of zipWith, that keeps the tail of the longer list:

zipWith' = unionByWith (const $ const EQ)

We can even implement concatenation:

(++) = mergeByR (const $ const LT) undefined (:) (:) []

Comments

Josef Svenningssonx

I like this generic merge function. It is interesting that you made it possible for the function to return results of other types than lists, something I probably wouldn't have done. But none of your examples use the fact that you can return some other type. Do you have such an example? It would be nice to see if the extra power is really motivated.

Duncanx

My generic merge looks like:

data MergeResult a b = OnlyInLeft a
                     | InBoth a b
                     | OnlyInRight b
mergeBy :: (a -> b -> Ordering) -> [a] -> [b] -> [MergeResult a b]

So that captures, once and for all, the merging aspect but it leaves you with the problem of what to do with the MergeResult. Most uses do a mergeBy followed by a fold of some kind.

For example we can select out subsets, like the intersection, or lefts paired with maybe rights or vice versa. We can do it lazily with a right fold or if we're checking that all the input falls into some category we can do a left fold, accumulating the good results and failing if we find any bad, or partitioning the good and bad and returning both.

I suspect these two merges are equivalent. I could probably implement mine in terms of yours like:

mergeBy cmp xs ys =
 mergeByR cmp InBoth OnlyInRight OnlyInLeft []

The other way round is probably a foldr where we do case analysis on the merge result and apply on of f_xy f_x f_y as appropriate.

Josef: The flexible return type allows you to combine mergeByR with other folds, in the same way as foldr/build list fusion. For instance, conting the number of common elements with (length . intersection) can be fused to (mergeR (\_ _ n -> n+1) id id 0).

Duncan: Yes, they are the same (or at most a foldr away). Case analysis on your MergeResult is captured by a deconstructor with the type:

deMergeResult :: (a -> c) -> (a -> b -> c) -> (b -> c) -> MergeResult a b -> c

Now foldr (deMergeResult fx fxy fy) z . mergeBy cmp is the same as mergeByR.

wren ng thorntonx

Would you mind if I kidnapped this for list-extras? I essentially already have it hidden away in Data.List.Extras.Pair, though not quite generalized from zipping to the merge case.

It can be generalized further to distinguish between the LT/GT cases and the cases where one list is longer than the other--- the particular use case being to detect length mismatch in a single traversal as is done by pairWithBy and friends.

wren ng thorntonx

Er, your blog seems to have eaten the link: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/list-extras

sdx

The genericity of this function is great, but there is no such thing as a merge/intersection. You either want to merge or to intersect list but the two concepts are mutually exclusive. Using the appropriate one will probably be faster and easier to understand. Remember the UNIX philosophy? It applies very well to functional programming. Do one thing, but do it well

Josef Svenningssonx

Twan, you don't need the extra general type to make your function work well with foldr/build fusion. The more specific alternative returning a list can be made to fuse just as well.

It's interesting that you should mention foldr/build fusion because your function is an example of where foldr/build fusion doesn't work that well. Suppose you have something like 'mergeByR cmp fxy fx fy z (build f) (build g)'. Now we would like to fuse both argument lists so that no intermediate lists remain. This seems impossible to do with the foldr/build framework. Switching to destroy/unfoldr solves the problem.

José Pedro MagalhãesDate: 2011-06-18T11:42Zx

Nice! Though to be really (datatype-)generic, I would expect it not to take lists, but just "any" f (probably a Functor f, or Traversable f, or so).

Twan van LaarhovenDate: 2011-06-20T01:34Zx

José: Most Functors can not be merged. For example, data Pair a = Pair a a is a Functor, but it doesn't even allow concatenation.

But there are some possibilities for further generalization. Note that Ordering is isomorphic to a pair (Bool,Bool), not both False, saying which items come first. The combining functions could be replaced by a single one with type Maybe a -> Maybe b -> c -> c, where it is never the case that both arguments are Nothing.

To merge three lists you could use triples (Bool,Bool,Bool), and a lot of combining functions. Maybe you can do something interesting with a type other than Bool.

Reply

(optional)
(optional, will not be revealed)
What greek letter is usually used for anonymous functions?
Use > code for code blocks, @code@ for inline code. Some html is also allowed.