February 9, 2011
Thank you for the opportunity to comment further on the SEC's Consolidated Audit Trail (CAT) proposal. This note is less directed towards the actual proposal put forth, and more towards reasonable next steps. I appreciate that the proposal you presented is really just phase 1 of a system to get a real-time sense of the market on days when unique and potentially catastrophic events are transpiring rapidly, and that you plan to address these analytic issues in the future.
I think one of the first things to be done is to create a sample set of prototypical queries and sample responses: assuming you have the data as put forth in the proposal, what sort of things do you want to glean from it? In designing the repository for CAT (real-time trade and quote) data, you really need to know the types of informational requests that this repository might reasonably expect to see. A dozen (or fewer) sample queries would be a good place to start. In addition, you should define what "real-time" means in the context of a query response. (If the query returns, say, 200,000 records -- which might be every quote for an active stock on an exchange on a given day -- should this take 1 second or 1 hour?)
(The CAT proposal indicates that you have been doing some good thinking about the types of answers the system will need to be able to come up with please just make these explicit.)
Once you have these sample queries in hand, you can then begin to design the layout of the data in the repository: the repository architecture should reflect query language. And I believe you should own this repository, and not outsource it to some third party. When systemically important events occur, you need to have the people who designed the query engine work closely with those maintain the repository. You need to be capable of fighting the "next war," and so you might be required to query the data in ways that were not imagined before. If this occurs, you need immediate access to all the data without any intermediary, to allow you to make "battlefield" enhancements to the database, perhaps to optimize query response.
BTW, "owning the data" does _not_ necessarily mean building a huge datastore. I estimate CAT data will take up on the order of 10TB/day, or 2.5 PB/year. You can store this in an off-the-shelf cloud-based storage system for well less than $10M/year, which is 1% of the $2B-4B cost that you believe the industry might spend to collect the data. And if you do decide to build a petabyte storage facility, you could do it for roughly 10x cheaper (and this would be a once-time cost, not per year).
Also, I estimate that an integrated analysis system combining bespoke software for first-cut filtering of data from the repository, along with COTS SW for detailed analysis, could be developed for less than $10M. (I assume the queries supported would be relatively simple for the first version a simple record store to support this query system should be very responsive.) And while you're in there, you could develop an NBBO DB for relatively little marginal expense.
Anyway, best of luck with your endeavors.