Posted 26 November 2019
by DeltaXignia

Share this blog:

Read more in Product How-tos

Matching Records that are in No Particular Order

By default, elements are treated as ordered when XML Data Compare aligns XML elements for comparison. However, in certain instances such as entries in an electronic address book, data can be stored in any order. It is more common when using XML for data that the order of elements are unimportant during data capture or storage. Thankfully there’s a quick way to obtain matches when using XML Data Compare.

Ordered Data

Looking at the public data from the USA: Biodiversity by County – Distribution of Animals, Plants and Natural Communities.  (See  here ) There was just one file published so I split out the records for two counties, Albany and Yates and decided to compare them.  There is a container called with multiple elements each looking like this:


Animal
Amphibians
Frogs and Toads
Lithobates sylvaticus
Wood Frog
1990-1999
Game with open season
not listed
S5
G5
Recently Confirmed

A comparison with the empty config file was not very helpful as the records were not in the same order.

Specifying Orderless Data

By using the config file to tell the comparison to ignore the order, the heuristic algorithm will be used for matching. The location of the elements to be considered orderless is the container, the rows element. This is specified in the config file using a location element:




This gives a much more helpful comparison, with row elements that match looking like this:

Text-rows-highlighting-found-changes.png

Ignoring Attributes

However, there is still some tidying up to do. We are not interested in the attributes so they can be ignored by adding this to the config file:












So now the row elements that match exactly are collapsed.  Rows that only exist in Albany or in Yates are shown clearly:

row-match-using-keyignore-attributes-1.png

Where an animal or plant has a corresponding record in both counties but there are variations this is clearly shown as here where the year last documented is different:

response-animals-with-small-changes.png

Using a Keyed Comparison

The heuristic algorithm has to do a lot of work to decide on how rows align, and can’t take account of the meaning of the data. If you know what fields in the data uniquely identify each row you can add a key that clarifies what to do and speeds up the comparison.

In this case the scientific names can be used as a unique key. The extra line in the config file specifies a key as on the third line below:





The location is the container, the rows element.  Within the the elements to match are the elements and the key to use is the scientific name.

In this case, the matches are exactly the same as when just using the in-built heuristic algorithm.  The speed of the comparison was a few percent faster.  Specifying a key might give a more accurate match in some cases but is usually not necessary.

More Details

For more information on how to compare orderless data, you can see our  Orderless Comparison  guide.

There is a  range of samples  available on Bitbucket.

© 2000-2025 DeltaXML Ltd. registered in England and Wales (Company No. 2528681), trading as DeltaXignia. All rights reserved