Skip to main content

Policy Challenges and Research Opportunities in the Era of Big Data

Big Data and High-Performance Computing for Financial Economics, National Bureau of Economic Research, Cambridge, MA

July 13, 2019

I am pleased to have the opportunity to speak at this National Bureau of Economic Research (NBER) conference on big data and high-performance computing. Before I begin my remarks, I need to mention that the views that I express today are my own and do not necessarily reflect the views of the Commission or its staff.[1]

1. Introductory Story

The term "big data" is new, but the underlying phenomenon is anything but new and it is certainly not unique to financial economics. Consider, for example, the U.S. census, which is taken every ten years as required by the U.S. Constitution. It is a seemingly simple task to count people and report demographic information such as marital status and family size. Yet by 1870, the quickly expanding U.S. population hampered the ability of the Census Office to tabulate results effectively. In fact, the 1880 census, which was hand-counted, took nearly ten years to complete. In other words, the 1880 census involved big data. Herman Hollerith saw the opportunity and left the Census Office before the 1880 census to develop a machine that could count and tabulate the results. His machine was tested in 1887, and it was quickly leased by the Census Office for the 1890 census. His success in 1890 led to contracts with foreign governments and private companies. Hollerith machines were used in 1891 for censuses of Canada, Norway, and Austria; railroad companies used them to calculate fare information; etc. In other words, Hollerith machines efficiently solved many important big-data problems of the day.[2]

Today, 150 years later, where do we stand? We stand on mountains of data that are inconceivably larger. By some estimates, the world generates more data every two days than all of humanity generated from the dawn of time to the year 2003.[3] How much data is generated by or for the SEC? One easy answer is that the SEC's Electronic Data Gathering, Analysis, and Retrieval system (or EDGAR) receives and processes about 2 million filings a year. But those filings are themselves complex documents, many of which contain scores of pages, numerous attachments, and many thousands of pieces of information.

What is big data? I think it is kind of like old age: anyone older than me is old, and any data set bigger than my computer system can process is big. What does "big" mean to the SEC? The SEC processes and maintains several big data sets. One example is the Option Pricing Reporting Authority data, or OPRA data. One day's worth of OPRA data is roughly two terabytes.[4]

Big data are often characterized by so called "three v's," which are volume, velocity, and variety.

Volume is the quantity of the data.

Velocity is the speed at which the data are created and stored.

Variety is the heterogeneity of the data in term of data type and data format.

To this list of three, some would add a fourth "v," veracity.

Veracity is the quality and accuracy of the data.

2. Policy Challenges

Like the Census Office 150 years ago, the SEC faces a big-data problem today. This leads me to the first question that I want to highlight in this talk: What are the policy challenges that stem from big data at the SEC?

Let me begin by reminding you that the mission of the SEC is to (1) protect investors; (2) maintain fair, orderly, and efficient market; and (3) facilitate capital formation. I see several big-data policy challenges in light of the SEC's three-fold mission.


Let me begin with security, which is a primary concern of the SEC. The volume, velocity, and variety of big data make security particularly challenging for several reasons. First, big data are harder to store and maintain. For example, it is harder to ensure that only the right people at only the right time have access to only the right data. Second, big data are bigger targets for bad actors. For example, portfolio holdings data for all investment advisors are more valuable than portfolio holdings data for one investment advisor, and weekly portfolio holdings data are more valuable than annual portfolio holdings data. These challenges get harder as certain data sets start to include more personally identifiable information (PII) or identifiers that link investors and institutions within and across data sets.

The SEC must be mindful of the data it collects and its sensitive nature, and the SEC must be a principled, responsible user of that data. Naturally, data collection is not an end unto itself—the SEC must not be in the business of ill-defined and indefinite data warehousing. For these reasons, the SEC continues to look into whether it can reduce the data it collects or reduce its sensitivity. One example of this is the SEC's approach to Form N-PORT, which is a new form for reporting both public and non-public fund portfolio holdings to the SEC. The Commission recently modified the submission deadlines for this information in order to reduce the volume of sensitive information held by the SEC. This simple change reduced the SEC's cyber risk profile without affecting the timing or quantity of information that is made available to the public.[5]


Another policy challenge is technology. For example, the potential trading gains from having computer systems and other technologies that are even just a little faster and smarter than the competition are enormous. Thus, there is a technology arms race between trading firms that are striving to get the best technology and the best personnel. The media regularly reports about institutions that are increasing their use of AI, machine learning, and related tools.[6] However, there may be fixed costs to the deployment of these technologies that exclude small, fragmented, or less resourceful investors.

Second, there are cultural differences between organizations that affect not just the choice of which technology to deploy but also the timing of deployment. For example, hedge funds might be able to adopt new technologies such as cloud computing more quickly than pension funds are able to do so.

Third, some technologies are inherently challenging for the SEC to monitor. To mention just one example, consider artificially intelligent algorithmic trading (AI algo trading), which trade through time in non-predictable ways. Suppose an AI algo eventually starts spoofing without the knowledge of the algo creator. (Spoofing is a prohibited activity than involves creating and cancelling a large number of trades in an attempt to convey false information about market demand.) How should the SEC respond to that?

And speaking of fast-moving technology, how does the SEC develop or attract a workforce that not only sees and understands the current state of the art but that can also envision and prepare for the future? The SEC has prioritized and supported the development of a workforce with big data skills and experience. Over the last 10 years, DERA's headcount has grown from a little over 30 people to nearly 150 people today.


Another big-data policy challenge is communication because the SEC has diverse stakeholders. The SEC focuses on "Main Street" investors, meaning individual, retail investors who typically invest through their 401(k)-style plans. But our stakeholders also include pension funds, municipal bond issuers, brokerage firms, hedge funds, and Congress. The issues surrounding big data are complex and increasing require specialized training to understand. So it is challenging to communicate the essential parts of these markets to each group of stakeholders. Indeed, one size does NOT fit all.

While I am talking about communication, I would like to mention an important detail about the Herman Hollerith story. A key insight into the census data problem was the realization that the variety of the data could be dramatically reduced by requiring the data to be transcribed onto what we would now call punch cards. With all of the data in one standardized form, it was relatively easy to build a machine that could tabulate the information. This principle still holds true today. For example, the SEC has required filers to tag some data using methods such as XML, FIX, FpML, XBRL, and, more recently, Inline XBRL. By dramatically reducing the variety of the data, tagging transitions an electronic document from being human readable into one that is also machine readable. A perennial challenge of the SEC is to find cost-effective ways to reduce the variety of financial data without loss of substantive information.

An additional feature of data tagging is network effects. It is well known that data in tagged 10-Ks can be linked to data from other forms and other firms. Perhaps it is less appreciated that data in tagged documents could be linked across regulatory boundaries and even national boundaries provided the regulator community required similar data tagging. For the SEC, a key benefit of cross-regulator consistency in tagged data is the ability to understand better the nature of the risks in the financial markets. The markets today do not stop at national borders, so looking only at intra-national data provides only a partial picture of the system’s risk.

3. Research Opportunities

The second key question of my talk is about research opportunities in the era of big data.

I see many research opportunities for DERA's financial economists, for academics, for industry, and for anyone who values financial data. Broadly, I see opportunities based on large databases that are available now or that might be available in the near future. I also see even more opportunities based on changes that are being made to existing data sources. In addition to a myriad of academic questions, big data will continue to help the SEC and other market regulators identify and shut down bad actors.

In addition to the OPRA database that I have already mentioned, I would like to highlight an additional database. Afterwards, I will highlight two other areas that will open doors for new research opportunities.

The Consolidated Audit Trail

On July 11, 2012, the Commission voted to adopt Rule 613 under Regulation NMS. This was a significant mile marker along the path to create and implement the consolidated audit trail (CAT). When completed by the self-regulatory organizations (SROs), the CAT will provide a single, comprehensive database enabling regulators to track more efficiently and thoroughly all trading activity in equities and options throughout the U.S. markets. This will transform the market surveillance and enforcement functions of regulators. For example, regulators will be able to track the activity of a single individual trading in multiple markets across multiple broker-dealers. The CAT will not be available to academics or industry for research, marketing, or other purposes.[7]

Standardized Structured Languages

The three v's of big data are volume, velocity, and variety. It is hard to imagine that future finance data sets will have less volume or less velocity than they do today. So perhaps the best way to make future data sets more manageable is to mimic Herman Hollerith's census solution by attacking variety.

Since the mid 1990s, most SEC documents have been submitted to EDGAR. Although the submissions are electronic and can be easily read by a human on any computer, they are not machine readable because they are essentially unstructured, electronic paper. Content not only changed across filers and across time, but so, too, did the format---plain text, html, pdf, and others. Subsequent initiatives by the SEC have made it easier for people, machines, and regulators to read and understand the disclosures on EDGAR.

An important milestone was reached on May 7, 2003 when the SEC adopted its initial requirement to file forms 3, 4, and 5 using the eXtensible Markup Language (or XML).[8] I believe that the structuring of these forms in XML lowered access costs and analytical costs, making this information more valuable to the market. Since 2003, many more forms are now submitted in XML, FIX, FpML, XBRL, and, most recently, Inline XBRL.

Structuring disclosures so that they are machine readable facilitates easier access and faster analyses that can improve investor decision-making and reduce the ability of filers to hide fraud. Structured information can also assist in automating regulatory filings and business information processing. In particular, by tagging the numeric and narrative-based disclosure elements of financial statements and risk/return summaries in XBRL, those disclosure items are standardized and can be immediately processed by software for analyses. This standardization allows for aggregation, comparison, and large-scale statistical analyses that are less costly and more timely for data users than if the information were reported in an unstructured format.[9] Structured data will likely drive future research in corporate finance and macroeconomics.

Standardized Identity—the LEI

Another common big-data problem is accurately and timely connecting disparate big data sets for analyses. This problem is exacerbated by the broad range of identifiers used by federal agencies: the IRS has the Employer Identification Number (EIN); the Federal Reserve has the Research Statistics Supervision Discount identifier (RSSD ID); FINRA has the Central Registration Depository (CRD); and the SEC has the Central Index Key (CIK). A recent report identified 36 federal agencies using up to 50 distinct, incompatible entity identification systems. In my opinion, these differences raise costs and burdens for both federal agencies and their regulated entities.

The Global Legal Entity Identifier (LEI) is a 20-character alpha-numeric code that provides a single unique international identifier enabling accurate identification of legal entities. As such, it offers a single international connector for disparate big data sets while also reducing the current regulatory burden associated with each agency's unique identification system. The LEI includes "level 1" data that serve as corporate business cards. (It answers the questions "who is who?") The LEI also includes "level 2" data that show the relationships between different entities. (It answers the question "who owns whom?") The LEI serves as a Rosetta Stone to identify clearly and uniquely firms and entities participating in the global financial markets.

Recently, the Commission released rules that mandate the use of LEI when associated with security-based swap transactions. The LEI is now a component of mandatory swaps transaction reporting in the U.S., Europe and Canada. Europe has mandated future LEI usage widely, including in payment and settlement activities as well as structured finance. [10]

I believe that the full benefits of LEI have yet to be realized. As some companies may have hundreds or thousands of subsidiaries or affiliates operating around the world, more benefits lie ahead as the LEI becomes more widely and comprehensively used. The LEI allows more transparency regarding hierarchies and relationship mapping. This will support better analyses of risks as they aggregate and potentially become systemic.[11]

4. Conclusion

I am really looking forward to today's robust discussion of Policy Challenges and Research Opportunities in the Era of Big Data. You are helping expand the future of finance in important ways that will surely have positive effects on markets, investors, and businesses. Thank you.

[1] The Securities and Exchange Commission disclaims responsibility for any private publication or statement of any SEC employee or Commissioner. This speech expresses the author's views and does not necessarily reflect those of the Commission, the Commissioners, or other members of the staff.

[3] International Data Corp (IDC).

[4] See Scott W. Bauguess, The Role of Big Data, Machine Learning, and AI in Assessing Risks: a Regulatory Perspective (June 21, 2017), available at:

[5] See Chairman Jay Clayton, Keynote Remarks at the Mid-Atlantic Regional Conference (June 4, 2019), available at:

[7] See Chairman Jay Clayton, Keynote Remarks at the Mid-Atlantic Regional Conference (June 4, 2019), available at:

[9] See Commissioner Michael S. Piwowar, Remarks at the 2018 RegTech Data Summit—Old Fields, New Corn: Innovation in Technology and Law, available at:

[10] See Commissioner Kara M. Stein, Quality Data and the Power of Prevention: Remarks at Meet the Market, North America, available at:

[11] Id.

Return to Top