Difficulties of Implementing a Data Downloader
HTTP based APIs (like Tushare and AKshare) are very convenient, but when it comes to stock research, backtesting or auto-trading, they have some fatal drawbacks:
- Network delay: the network delay would accumulate to an unacceptable level when the computation involves many queries.
- API limitation: most of the public data API endpoint has a low limitation of fetching data.
To overcome these difficulties, one solution is to "cached" them in your private database.
Nature of Web-based API
The nature of all APIs are:
- Long RTT(round-trip-time)
- Low request frequency
-
Limited records returned
-
To overcome 1, the solution is parallelization and cache.
- To overcome 2, the solution would be batch and cache.
- To overcome 3, we use batch with limited size.
Nature of Stock Data API
Most Stock data APIs can be categorized as:
API params \ Data indexed by | symbol & timestamp |
---|---|
symbol or timestamp | depending on the size of the data |
symbols(s) | by symbol |
timestamp / timerange | by timerange |
So Stock Data API can be defined as:
class API:
{
"biz_key": "tushare_daily"
"fetch_methods": {
"by_time": by_time # None | Callable[str, start, end]
"by_symbol": by_symbol # None | Callable[list[str], start, end]
},
"limit": {
"qps": 10 # query per second, int
"rpq": 6000 # response per query, int
},
"frequency": timedelta(1) # frequencies of time.
}
Variety of data schema
For stock data, there're so many data endpoints with so many different schemas. It's not realistic to replicate such amount of APIs. But we can directly store each data record as a json object to decouple the fetching and preprocessing stage.
Most of the stock data types are indexed by two dimension: time and symbol. The symbol can be stock, fund, index or something else.
So each record in the table would be:
(biz_key, symbol, timestamp, data) or (biz_key, symbol, timerange, data)
timestamp can be represented by timerange as long as we state that the left endpoint of the timerange is the timestamp to be represented, so record in the table could be:
record = (biz_key, symbol, timerange, data)
, where biz_key
is the identifier of the API.
With this schema, we can adapt to many different API of many different data providers easily.
API abuse: Meaning less request
The data request is usually like this:
request = (biz_key, symbol_set, timerange)
But when most of the data in the request is already fetched, fetching all data specified by the request would be wasteful.
What's more, the underlying API might be frequency limited or returning record limited, which further complicates the problem.
Solution 1: brute force
We assume that the data provider can provide both fetch-by-time and fetch-by-symbol methods, the size of the whole dataset is acceptable for both storage and time to fetch. Then we just need to:
- assume that all presented symbol is up-to-date for the last update time
- and just fetch full history of those missing symbols till the last update time
- and finally fetch all data from last update time to now by time.
Solution 2: fine grianed control
Remember our purpose: we want to avoid abusing the API. So the target is not to avoid all meaningless request, but try to cut as more meaningless request as possible.
To avoid abusing the API, the solution is merging the underlying data request. But before we can merge the data requests, we have to find them out first.
Tracking the missing data
One solution is, for each (biz_key, symbol)
, maintaining a table raw_data_coverage
recording the
data fetched during timerange
. Every time we bulk write the raw_data
table, we also bulk request
and update raw_data_ccoverage
. And every time we update the table, we spend O(S)
time to calculate
the indexes of missing records, where S
is the size of the symbol set given.
There're some corner stage. Some data is missing because of its nature: the market is close on that day, the company is unlisted, not listed, suspended and so on. So the calculated set is a superset of the missing data. The analyze and solution can be found below
The merging direction
Different API suits different merging direction. If the API is a crawler that craw detail page of a specific symbol, then we should merge in the time direction; if the API can provide a fetch several symbols in batch, then merging in the symbol direction is also a good choice. As long as the stride is under the limit specified by the API, merge all requests.
But which are better? My solution is: take both and compare. The complexity is just O(T) + O(S)
.
So the computing costs is much lower then the O(TS)
fetching costs.
One corner case is that the missing data is distributed evenly. This would trigger a fetch of all existing data. But since the request is both symbol locally and time locally, this case would be rare.
Known missing data
Some data does not exists in the certain time range. If we ignore this nature of the stock data, we will end up
with a fragmented raw_data_coverage
table. Until now, the semantic of the coverage table is: the time range the
data covered. But to avoid fragmentation, and to make the table better serve the purpose of avoid duplicate API calls, let us shift the semantic to "the time range that do not fetch again".
But how/when to declare these time ranges? There're two approaches, a strict approach and a lapse approach. If we take the strict approach, we need a trustful source that tell us that which ranges are not covered. If we take the lapse approach, we just simply black out all those we have tried, and those return an empty data.
The later one is more pratical: if no data is returned, just assume the data does not exists. Data in the past logically will not be updated in the future. So missing data is always missing.