Top 6 things about daily market data that even institutional traders get wrong

3 min readJun 14, 2024

At Databento, we serve a large number of institutional trading firms. Over time, we’ve seen some recurring mistakes and misconceptions among our customers, even very sophisticated quant researchers and traders, when it comes to market data.

Surprisingly, this includes the simplest form of market data—just plain daily frequency data.

This article details 6 common pitfalls when it comes to market data.

1. Official is not always better

An exchange’s official daily statistics (like high, low, settlement, open interest, volume) don’t necessarily match actual numbers you could derive from their electronic trading sessions.

2. Why official data doesn’t match

Many trading venues have semi-discretionary steps involved in computing official statistics. Or they may have a different methodology — do they double count? Include legs of related spreads? Include OTC or block trades? Things that are hard to replicate in a backtest.

3. Consistency

Most data vendors only give you the official daily volumes and there’s no transparency in how these are derived. In most cases if you take the same vendor’s tick history and sum up the volume, it doesn’t match.

You can see how this creates a problem. e.g It’s common to build execution algos that use signals that are sampled on some % ADV interval, but ADV could differ wildly if it’s based on intraday data from the electronic session vs. if you’re just getting it from the official statistics. This is why Databento separately provides official exchange volume and volume derived from tick data.

import databento as db

client = db.Historical()

# Official daily settlement prices
STAT_TYPE_SETTLEMENT_PRICE = 3    # https://databento.com/docs/knowledge-base/new-users/fields-by-schema/statistics-statistics?historical=python&live=python
df = client.timeseries.get_range(
    dataset='GLBX.MDP3',
    schema='statistics',
    symbols=['ESM4'],
    start='2024-04-02',
    end='2024-04-03',
).to_df()
print(df[df['stat_type'] == STAT_TYPE_SETTLEMENT_PRICE])    

# Open, high, low, close and volume derived from intraday data
df = client.timeseries.get_range(
    dataset='GLBX.MDP3',
    schema='ohlcv-1d',
    symbols=['ESM4'],
    start='2024-04-02',
    end='2024-04-03',
).to_df()
print(df['close'])

4. Data integrity

There’s value in computing your own summary statistics from intraday data and comparing them to the official numbers as an integrity check.

5. Point-in-time

Timestamping precision matters even for daily data. Even statistics that are computed automatically may be published with a large degree of time variation.

e.g. Closing or settlement prices aren’t guaranteed to be published soon after the close. We’ve seen settlement prices get published variably between a minute to 30+ minutes after when they are supposed to be published according to platform docs.

Nick Macholl at Databento made a useful visualization of how CME statistics can get published with significant time variation, using nanosecond timestamps included in Databento’s daily settlement data:

Publication times of official daily statistics from CME. By Nick Macholl (Databento).

6. Lookahead effects

Lookahead effects: If you’re backtesting on daily settlement prices, having insufficient timestamping precision on your daily data could create unexpected lookahead effects, e.g. assuming that you’ll get the close price immediately at the close.