Unique ID

Deduplication of the data acquired in the distributed data pipeline is accomplished by using a common id. All records that are related to the same physical location need to end up having the same unique id.

Starting out, there are no unique ids. We only have data sources that can’t contribute data to any existing record. Looking back to the source A example from the previous post, we will start with only:

{
   "_source": "A",
   "id": "optimile-a82s12",
   "name": "Optimile",
   "street": "Munthofstraat",
   "city": "1060 Saint-Gilles"
}

We have two options, using a manual or automated process to create the unique id. They both have pros and cons. I will only touch the main differences.

Manually creating the records will result in higher quality because humans are good a cross referencing multiple data points to validate the location. E.g. One can use Google Maps, Company Websites and other public sources to confirm the validity. You maybe even call them in case they have a phone number.

Automated processes usually lead to more duplicates and less quality results. Especially in cases where multiple different companies have owned a location or the positioning of the actual location is difficult, e.g. when map data is different from postal data. Using multiple sources and scoring based algorithm the creation of the unique id can be automated greatly. Usually then the manual process will only correct errors.

Once The identity is clearly either related to an existing record or a new record a new UUID will be generated and postfixed with the name of the source, e.g. "d39fed96-58ef-446d-9039-d1c56e7be592/A".

In order to not create problems with the main dataset, the records will be stored in a separate database until they get an ID.

So we end up with the following set of databases:

NameWith _id unique idDescription
newnocontaining all records not yet identified
sourceyes, with source postfixcontaining all the source records
finalyes, only UUIDcontaining all merged records

The process of a record going “through” the databases using the following pseudo code:

record = // ... source dependant mechanism to get the record data
srcRecord = db.source.find({ "_source": "A", "id": record.id })
if srcRecord {
  db.source.update(srcRecord.updateFields(record))
} else {
  newRecord = db.new.find({ "_source": "A", "id": record.id })
  if newRecord {
    db.new.update(newRecord.updateFields(record))
  } else {
    db.new.create(record)
  }
}

We assume there is an automated process in that will create new final records like this:

watcher = db.source.watch()
for event = watcher.next() {
  events = db.source.findAllWithPrefix({ "_id": event.IdPrefix() })
  finalRecord = merge(events)

  curRecord = db.new.find({ "id": finalRecord._id })
  if curRecord {
    db.new.update(curRecord.updateFields(finalRecord))
  } else {
    db.new.create(finalRecord)
  }
}

Assuming a manual assignment flow, we only need a UI that allows to set the unique id of the record, while deleting it from the new database after creating it in the source database:

newRecord = // ... selected by a frontend to be a new record
// in case the record is assigned to an exiting id, use the existing id
newRecord._id = UUID.new + "/" + newRecord._source

srcRecord = db.source.find({ "id": newRecord._id })
if srcRecord {
  db.source.update(srcRecord.updateFields(newRecord))
} else {
  db.source.create(newRecord)
}

In the next post we will look at ways to reduce the database load.