Tags | Distributed

Reducing the database load

In the previous post we discussed the importance of the unique ID for every record. Still we will update the records multiple times a day even if we don‘t change anything. Remember we scrape the data. Assuming 1Million records with 10 different sources e.g. and a scrape interval of 5 minutes we easily have a database load of 1M * 10 sources every 5m which equals to 120M rec/h which equals 33K req/s which has to potential to overload the database depending on the technology.

2023-03-18

Unique ID

Deduplication of the data acquired in the distributed data pipeline is accomplished by using a common id. All records that are related to the same physical location need to end up having the same unique id. Starting out, there are no unique ids. We only have data sources that can’t contribute data to any existing record. Looking back to the source A example from the previous post, we will start with only:

2023-02-19

Distributed Multi Source Continuous Data Pipeline

In this series I will detail a solution to a common problem with distributed data aggregation. We want to build a web application that displays current price and location data for EV charging stations on a map. The data is scraped from websites, sourced by the government or similar data sources. The here described solution has been in production for years. So it is known to solve the above problem.

2023-01-21