August 10, 2010

Cooking 2.0

In my last post I proposed some refactorings on the current implemention of the EMF Compare match engine and tried to explain (from my point of view) the reasoning behind it. In this post I'll show how another matching strategy - based on a neighborhood search instead of a full iteration with pairwise matchings - fits into this adopted structure.

The major bottleneck of the current matching algorithm is the computation of the "expensive" pairwise similarity scores. The implemented heuristic tries to overcome this problem by applying a search window which reduces the number of necessary pairwise similarity score calculations on the one hand, but imposes several restrictions and the need for further processing on the other hand (e.g. how to detect model changes like moving an element out of the search window).

Developing and implementing another strategy, based on an index for the faster retrieval of neighboring model elements, is the main objective of my GSoC project. In its current form, this indexing process leverages the following concept:
  1. transform the problem to a metric space where a notion of distance is defined and "cheaper" to compute → k-dimensional space with Euclidian distance
  2. provide an algorithm to map the model elements into this metric space while keeping domain-specific-similarity → FastMap
  3. use an appropriate data structure for the metric space enabling efficient nearest neighbor search → KDtree
Based on such indexed model elements - i.e. model elements represented by coordinates in a k-dimensional space - the matcher component can follow this path:
  • for each provided model element of version 1, find the immediate neighbors (all model elements with minimum Euclidian distance) of version 2
  • for each of these immediate neighbors of version 2, compute their immediate neighborhood and verify that the model element of version 1 is also an (the) immediate neighbor
  • if there are several "candidates" with the same minimum distance, delegate to the EMFSimilarity score for final retrieval of the "best matching" neighbor
...

Space space1 = new TreeBackedSpace();
  for (EObject object1 : objects1) {
   space1.add(object1, provideCoordinates(object1));
  }

  Space space2 = new TreeBackedSpace();
  for (EObject object2 : objects2) {
   space2.add(object2, provideCoordinates(object2));
  }

  NeighborhoodComputor neighborhoodComputor1 = space1.provideNeighborhoodComputor();
  NeighborhoodComputor neighborhoodComputor2 = space2.provideNeighborhoodComputor();

  for (EObject object1 : objects1) {
   double[] coordinates1 = space1.getCoordinates(object1);
   Neighborhood neighborhood2 = new ImmediateNeighborhood();
   neighborhoodComputor2.computeNeighborhood(neighborhood2, coordinates1, METRIC);

   if (neighborhood2.size() == 0)
    continue;

   EObject bestObject2 = null;
   double bestDistance = Double.POSITIVE_INFINITY;
   for (EObject object2 : neighborhood2) {
    double[] coordinates2 = space2.getCoordinates(object2);
    Neighborhood neighborhood1 = new ImmediateNeighborhood();
    neighborhoodComputor1.computeNeighborhood(neighborhood1, coordinates2, METRIC);

    if (!neighborhood1.contains(object1))
     continue;

    double distance = fastMap.distance(object1, object2);
    if (bestObject2 == null || Double.compare(distance, bestDistance) < 0) {
     bestObject2 = object2;
     bestDistance = distance;
    }
   }

   if (bestObject2 == null)
    continue;

   space1.remove(object1);
   space2.remove(bestObject2);
   handler.handleMatch(object1, bestObject2, bestDistance);
  }

  for (EObject object1 : space1) {
   handler.handleNoMatch1(object1);
  }

  for (EObject object2 : space2) {
   handler.handleNoMatch2(object2);
  }
 }
...
The full source code of the matcher component can be found in the implementation of the FastMapMatcher.

If you think that the code is the most understandable part of this blog post, then I know I haven't completely messed it up :-) Feedback welcome!

August 08, 2010

What’s Who‘s cooking?

There are already some quite nice introductions on what EMF Compare is trying to solve and in which way. I really recommend Cédric’s "What's cooking" post, so check it out if you need some starting point to this subject.

Providing new concepts and ideas for one fundamental part of the whole EMF comparison process - the matching phase - is the main objective of my ongoing GSoC project. Diving into the internals of the matching algorithm with its "cook" - the generic match engine - is therefore necessary to "open" EMF Compare for further adoption.

In this post I'll describe my understanding of the EMF Compare matching algorithm and the suggested changes on the current implementation of the generic match engine. In the next post I'll show by means of a newly developed matching engine, how these refactorings would allow to plug in other matching algorithms to the whole comparison process.

The EMF comparison process is divided into two distinct phases, the matching and the differencing phase. I will focus on the matching phase - browsing the different model versions and figuring out which model element matches which one.

The generic match engine has several responsibilities. It provides the different versions of the model elements - handling initialization, loading and preprocessing, e.g. usage of XMI resp. Ecore IDs for comparison or reference to structural similarity. It splits the whole search space into smaller processable blocks (=search context) based on similar root model elements. Within each search context, all associated model elements of one version are compared with the model elements of the other version. Different kinds of similarity metrics are leveraged in this step to derive appropriate matchings - still unmatched model elements are reprocessed in a further run. Also note that this matching step is applied recursively. Some kind of bookmarking/caching is performed based on the computational effort for calculating the similarity scores. Finally it creates the match model based on the matched model element pairs.

Although the generic match engine provides several hooks for customization, its still tight coupling makes it rather difficult to extend. Splitting up responsibilities would largely help for adoption.

The match engine keeps the "cook", handling data and orchestrating the overall workflow, but delegating the matching process to a separate component, the matcher.

This matcher component encapsulates the logic for the actual matching algorithm. In the case of the current implementation, it relies on pairwise matchings of all provided model elements (actually these pairwise matchings are only applied to a subset of the provided model elements bounded by a search window with a fixed size for performance reasons). In the next post I'll show how another matching algorithm fits into this modified structure.

The match similarity component is heavily used by the matcher component to compute similarity scores for a pair of model elements. Making it stateless (i.e. outsourcing caching concerns) would make it much easier to plug in your own similarity metrics.
These changes are currently tracked by a patch included in my accompanying GSoC project page. So if you are interested, just check it out.