August 08, 2010

What’s Who‘s cooking?

There are already some quite nice introductions on what EMF Compare is trying to solve and in which way. I really recommend Cédric’s "What's cooking" post, so check it out if you need some starting point to this subject.

Providing new concepts and ideas for one fundamental part of the whole EMF comparison process - the matching phase - is the main objective of my ongoing GSoC project. Diving into the internals of the matching algorithm with its "cook" - the generic match engine - is therefore necessary to "open" EMF Compare for further adoption.

In this post I'll describe my understanding of the EMF Compare matching algorithm and the suggested changes on the current implementation of the generic match engine. In the next post I'll show by means of a newly developed matching engine, how these refactorings would allow to plug in other matching algorithms to the whole comparison process.

The EMF comparison process is divided into two distinct phases, the matching and the differencing phase. I will focus on the matching phase - browsing the different model versions and figuring out which model element matches which one.

The generic match engine has several responsibilities. It provides the different versions of the model elements - handling initialization, loading and preprocessing, e.g. usage of XMI resp. Ecore IDs for comparison or reference to structural similarity. It splits the whole search space into smaller processable blocks (=search context) based on similar root model elements. Within each search context, all associated model elements of one version are compared with the model elements of the other version. Different kinds of similarity metrics are leveraged in this step to derive appropriate matchings - still unmatched model elements are reprocessed in a further run. Also note that this matching step is applied recursively. Some kind of bookmarking/caching is performed based on the computational effort for calculating the similarity scores. Finally it creates the match model based on the matched model element pairs.

Although the generic match engine provides several hooks for customization, its still tight coupling makes it rather difficult to extend. Splitting up responsibilities would largely help for adoption.

The match engine keeps the "cook", handling data and orchestrating the overall workflow, but delegating the matching process to a separate component, the matcher.

This matcher component encapsulates the logic for the actual matching algorithm. In the case of the current implementation, it relies on pairwise matchings of all provided model elements (actually these pairwise matchings are only applied to a subset of the provided model elements bounded by a search window with a fixed size for performance reasons). In the next post I'll show how another matching algorithm fits into this modified structure.

The match similarity component is heavily used by the matcher component to compute similarity scores for a pair of model elements. Making it stateless (i.e. outsourcing caching concerns) would make it much easier to plug in your own similarity metrics.
These changes are currently tracked by a patch included in my accompanying GSoC project page. So if you are interested, just check it out.

No comments:

Post a Comment