Building a solution that is robust, efficient and sustainable first requires a strong foundation. For any venture relying on data, understanding the data is the foundation that will drive those projects to success. I’ve designed, built and rescued dozens of enterprise data stores in kdb+ over the last 15 years and its eye-watering how much wasted time and effort results from that fundamental lack of understanding.
So, the first task of any kdb+ architect is to understand the data and how it will be used going forward. Interpreting the data and planning for the volume is more than just knowing the number of messages and size of a typical message. That is why we’ve provided some insights on how you can better understand your data and how kdb+ should be utilized to ensure that you get the maximum benefits for your business.
Know your schema
It can be tempting to try and normalize data as much as possible. For example, if you’re capturing trades for multiple asset classes (such as equities/FX/options) within the application, it might be possible to normalize into a single trade table. There might be plenty of common fields that could mean it’s easy to store them all together, but what about those columns that only apply to certain trade types? They will be left null, many times over, which will carry a sizeable memory overhead for in-memory kdb+ tables. While there may be an option to minimize this impact by persisting data to disk intraday or utilizing compression, this will typically make the data less accessible therefore the architect needs to understand how the data is going to be used in order to prioritize the trade-off between memory overhead and ease of access. In my experience, it is usually a good idea to have separate tables per asset class but, unfortunately in many circumstances, the client will only realize this once the original developer has moved on.
When finalizing the schema of each table, get to grips with the different datatypes available in kdb+ and when it’s suitable to use them. This will not only deliver optimal storage volumes but also have a positive impact on performance. Utilizing the kdb+ symbol datatype for repeating strings will create an enumerated vector in kdb+ and ensure optimal retrieval performance. In contrast, utilizing the same datatype for non-repeating strings will result in poor performance because the enumerated vector will become ‘bloated’. If necessary, kdb+ offers choices for non-repeating strings to improve retrieval speed
It is important to look at how the data will be consumed. Are users going to subscribe to updates based on events published into the system, or will they request the data as required? For data subscribers, in addition to the latency/throughput considerations discussed below, consider how the subscriber will identify the data. Ask yourself:
- What columns or combination of columns will they want to filter on?
- Do we have the necessary attributed set-up to allow this filtering efficiently?
- Do they need aggregated/enriched data?
- What update rate can they handle?
- Do we need to implement chained publishers to handle the potential impact of slow client consumers?
In all my years architecting kdb+ solutions I have yet to come across one that didn’t need to cater for a spectrum of customizations – and trying to retrofit these after the initial system has been deployed is a real pain and source of dissatisfaction for end user groups.
Latency vs Throughput
Once you have analyzed the schema and understand the users then it is time to ask yourself, how time-critical is the use case? Kdb+ excels at processing batches of data, so it’s typical for the processes publishing the data into it to push it in batches on a short timer. This significantly improves the throughput volume, but there will be a slight latency (usually in the 10-100 ms range).
Spikes in financial data are expected, either due to typical operating situations such as market open and auction periods, or because of world events, such as Brexit or the impact of the Covid-19 pandemic. Helpfully, the ability to change the batch size and/or publish frequency can provide additional security to deal with such incidents. Consuming data on a timer, while still satisfying client requirements, has saved my bacon on at least 2 monumental spikes – when Lehman went under in 2008 and during the 2010 flash crash, and there have been plenty of other examples since.
Data storage & persistence
The majority of data within a kdb+ application will be persisted on disk and presented in a process known as the HDB (historical database). Because these databases can grow to be many terabytes or even petabytes of data, the correct storage architecture must be implemented.
When considering how to define the storage, identify how the data will be accessed. Older data which may be accessed less frequently could be stored on cheaper storage and newer, more critical data, stored on the faster storage. Ask yourself, is network storage appropriate (for all or part of the data) to increase reliability in case of machine/process failures?
In addition, the schema utilized for storage is important. Disk performance is usually a bottleneck; therefore, the aim is to minimize the amount accessed from disk by understanding how the data will be obtained. If the most common use case is expected to be for a subset of instruments over a period of days, it makes sense to utilize the map-reduce features of kdb+ and partition the data into ‘date folders’ across multiple disks. This allows you to search the specified dates for the subset of instruments requested and map the data in as necessary from each disk concurrently to maximize i/o. If users are likely to request data for all instruments for a single date, it makes sense to split data by instrument across multiple disks, so you get concurrent read access for a single date.
Using 100GB of typical NYSE trade data as a benchmark data set, the graph below profiles the performance of a query selecting the same subset of data as the volume of data grows over time. With non-optimized partitions, the query performance will get steadily worse as the volume of data grows even though the query is only requesting the same data. With optimized partitions, there’s no performance degradation as the data volumes grow.
Also, consider the query benefit of pre-calculating some common stats (such as daily OHLC/volumes/vwaps) vs storage costs. If there are common enrichments which are frequently accessed and low in volume, it can be beneficial to store these.
Additionally, to ensure the data is utilizing the correct storage, you should leverage the attribute and threading functionality within kdb+ for the common use cases. Utlizing the same dataset as before, the graph below shows the impact a simple attribute would have on the same query as above. Adding an attribute to one column results in a massive improvement, even when the data is stored in non-optimized partitions.
In conclusion, to make the most of kdb+ and it’s unparalleled speed and flexibility then it is critical to spend time during the discovery and design phase to understand your data and how it is going to be used.
- Get the schema right.
- Learn how and what the data is going to be used for.
- Understand the latency and throughput requirements.
- Ensure the storage strategy is optimal.
If you’re being told that you don’t need to worry about this stuff at the start of the project then be very wary – you might be talking to someone who intends to maximise the cost of the engagement.
New: Data-at-Rest Encryption: Explore the options available with the kdb+ v4.0 release
Encryption is integral to protecting data from unauthorized users that try to read or corrupt confidential data.Our Data-at-Rest whitepaper explores encryption with the recent kdb+ v4.0 release; how it works, how to use it, and compares the performance of different techniques available, including encrypting and compressing simultaneously.