Reducing the Database Load

In the previous post we discussed the importance of the unique ID for every record. Still we will update the records multiple times a day even if we don‘t change anything. Remember we scrape the data. Assuming 1Million records with 10 different sources e.g. and a scrape interval of 5 minutes we easily have a database load of 1M * 10 sources every 5m which equals to 120M rec/h which equals 33K req/s which has to potential to overload the database depending on the technology.

Here we need to closely check our requirements. Some values might be helpful to know. We want to know when the price was last updated and checked. The problem is, we did check for the latest price but if it didn’t change we would not update and it looks as if the record is stale.

Example, at 9:55 the price is 1,55 and before the price was different so we update the value. At 10:00 we check the price again and the price is still 1,55. We want to indicate that the prices are still the latest prices but want to avoid the load on the database.

To solve the above problem we can record the date of the last change and how often we do check for the value. We will then not update if there are no changes to the record, but can display the following information to the customer:

Price: 1,55 (last changed 9:55, last checked 10:00)

Example code for the calculation:

import (
	"fmt"
	"time"
)

func main() {
  // example input data
  nowStr := "2009-11-10T10:12:32Z"
  lastChangedStr := "2009-11-10T09:55:00Z"

  now, _ := time.Parse(time.RFC3339, nowStr)
  lastChanged, _ := time.Parse(time.RFC3339, lastChangedStr)
  interval := time.Minute * 5
  duration := now.Sub(lastChanged)
  lastCheck := lastChanged.Add(duration.Truncate(interval))

  fmt.Printf("Price: 1.55 (last changed %s, last checked %s)", lastChanged, lastCheck)
}

With the above out of the way we can now focus on avoiding updates to the database.

Every record will have many different parts that are relevant for our change detection. Generally using hashes to avoid updates is one of the best options. Given the following example record:

{
   "_id": "d39fed96-58ef-446d-9039-d1c56e7be592/A",
   "_source": "A",
   "id": "optimile-a82s12",
   "name": "Optimile",
   "street": "Munthofstraat",
   "city": "1060 Saint-Gilles",
   "psi": "ksad78778123",
   "prices": [
     { "name": "product A", "price": 1,55 },
     { "name": "product B", "price": 1,32 },
   ]
}

Common pitfalls

  • Arrays might not be ordered and confuse the calculation of the hash
  • Key/Value maps, depending on the programming language are not ordered as well
  • There are values (in this example psi) that are random on every check, they need to be excluded from the hash, better even excluded from storing in the database.

title: “Beyond data optimization” url: reducing-db-load date: 2023-03-18 draft: false icon: db-check tags:

  • distributed
  • pattern
  • data

In case the database load can’t be reduced, because the data is changing constantly we usually can use one property of the data. The records are related to geo-locations. Sharding the data by geo-locations is a good way to distribute the load on a given database, but also makes the code that has to find duplicated more complicated, as one has to search around edges of the database shards.

Simply sharding data by country is not recommended. The distribution of small and big countries doesn’t fit well with the databases. Another disadvantage is the change rate of one country can be significantly different due to law and other circumstances like that.

One easy recommendable way for sharding is to put the source and production data in two separate databases. This also allows to change tuning parameters of the database. The production database usually is for reading mostly (besides the new updated records) and the source database mostly is for writing.

Summary

With a few techniques the database load can be greatly reduced to a reasonable (based on actual change) level. Last resort is sharding the database.