Top 6 things about daily market data that even institutional traders get wrong
At Databento, we serve a large number of institutional trading firms. Over time, we’ve seen some recurring mistakes and misconceptions among our customers, even very sophisticated quant researchers and traders, when it comes to market data.
Surprisingly, this includes the simplest form of market data—just plain daily frequency data.
This article details 6 common pitfalls when it comes to market data.
1. Official is not always better
An exchange’s official daily statistics (like high, low, settlement, open interest, volume) don’t necessarily match actual numbers you could derive from their electronic trading sessions.
2. Why official data doesn’t match
Many trading venues have semi-discretionary steps involved in computing official statistics. Or they may have a different methodology — do they double count? Include legs of related spreads? Include OTC or block trades? Things that are hard to replicate in a backtest.
3. Consistency
Most data vendors only give you the official daily volumes and there’s no transparency in how these are derived. In most cases if you take the same vendor’s tick history and sum up the volume, it doesn’t match.
You can see how this creates a problem. e.g It’s common to build execution algos that use signals that are sampled on some % ADV interval, but ADV could differ wildly if it’s based on intraday data from the electronic session vs. if you’re just getting it from the official statistics. This is why Databento separately provides official exchange volume and volume derived from tick data.
import databento as db
client = db.Historical()
# Official daily settlement prices
STAT_TYPE_SETTLEMENT_PRICE = 3 # https://databento.com/docs/knowledge-base/new-users/fields-by-schema/statistics-statistics?historical=python&live=python
df = client.timeseries.get_range(
dataset='GLBX.MDP3',
schema='statistics',
symbols=['ESM4'],
start='2024-04-02',
end='2024-04-03',
).to_df()
print(df[df['stat_type'] == STAT_TYPE_SETTLEMENT_PRICE])
# Open, high, low, close and volume derived from intraday data
df = client.timeseries.get_range(
dataset='GLBX.MDP3',
schema='ohlcv-1d',
symbols=['ESM4'],
start='2024-04-02',
end='2024-04-03',
).to_df()
print(df['close'])
4. Data integrity
There’s value in computing your own summary statistics from intraday data and comparing them to the official numbers as an integrity check.
5. Point-in-time
Timestamping precision matters even for daily data. Even statistics that are computed automatically may be published with a large degree of time variation.
e.g. Closing or settlement prices aren’t guaranteed to be published soon after the close. We’ve seen settlement prices get published variably between a minute to 30+ minutes after when they are supposed to be published according to platform docs.
Nick Macholl at Databento made a useful visualization of how CME statistics can get published with significant time variation, using nanosecond timestamps included in Databento’s daily settlement data:
6. Lookahead effects
Lookahead effects: If you’re backtesting on daily settlement prices, having insufficient timestamping precision on your daily data could create unexpected lookahead effects, e.g. assuming that you’ll get the close price immediately at the close.