The Role of Machine Readability in an AI World
Scott W. Bauguess
Deputy Chief Economist and Deputy Director, Division of Economic and Risk Analysis
May 3, 2018
SEC Keynote Address:
Financial Information Management (FIMA) Conference 2018
Thank you, Dan [Knight] for the kind introduction.
It is a pleasure to speak today at the 2018 Financial Information Management Conference. For over a decade this forum has been used to advance the usefulness – and use – of data in the financial services industry. And each year the challenges and opportunities of doing so grow. Technological advances are responsible for many recent changes in market methods and practices. Chief among them is the rise of machines in the automation of rote tasks. And also of many complicated tasks. Because human direction is not explicitly required, the analytical methods underlying the technology have given rise to the concept of machine learning. This has also fueled the notion that artificial intelligence has finally arrived.
This morning, I want to share with you some thoughts in this area, particularly as they relate to the role of regulatory data. But before I do, I must remind you that the views that I express today are my own and do not necessarily reflect the views of the Commission or its staff.
I first spoke publicly about the Commission’s use of machine learning more than 3 years ago. At that time I could not fully envision what it would do for both regulators and market participants. Since then, two new fields of practice have emerged: “RegTech” and “SupTech,” short for Regulatory- and Supervisory-Technology. Each uses machine learning methods to lessen the burden of either complying with or supervising a wide range of regulatory requirements in financial markets. And while neither field has reached maturity, both offer significant promise by way of improved market functioning and increased operational efficiencies.
At the Commission we are currently applying machine learning methods to detect potential market misconduct. Many of the methods are open source and easy to implement for those trained in data science. There is no need to rely on proprietary solutions, captive vendors, or complicated third-party support for data analytic success. This freedom has fueled the rapid innovation at the SEC, and I suspect also among your organizations.
But we all still face significant challenges in adopting these emerging methods. Identifying the appropriate computing environment is one of them. I’m sure the question “should I move to the cloud or keep my analytics on premises?” is not foreign to anyone in this room. Developing the right human capital is another. Everyone knows that they need a good data scientist, even if it is not entirely clear how to define what one is, let alone find it.
But there is another challenge that I think will be more enduring. The success of today’s new technology depends on the machine readability of decision-relevant information. And I don’t mean just for numerical data, but for all types of information. This includes narrative disclosures and analyses found in the written word. It also includes contextual information about the information, or data about the data, often referred to as “metadata.” Today’s advanced machine learning methods are able to draw incredibly valuable insights from these types of information, but only when it is made available in formats that allow for large-scale ingestion in a timely and efficient manner.
Humans versus Machines
SEC staff, particularly staff in the Division of Economic and Risk Analysis (aka “DERA”), have long recognized how essential it is to have usable and high-quality data.
The amount of decision-relevant data from SEC registrant disclosures is vast. The EDGAR filing system contains financial information covering more than $82 trillion of assets under management by registered investment advisors. It hosts financial statements by publicly-traded companies with an aggregate market cap of approximately $30 trillion. And since its inception, there have been more than 11 million filings by over 600,000 reporting entities using 478 unique form types. During the calendar year 2016 alone, there were more than 1.5 billion unique requests for this information through the SEC.gov website.
But not all of the data contained in SEC filings is easily accessible from a data analytic perspective. By design, many of the required forms and filings are narrative-based and intended for human readability. In many cases the numerical-based information is unstructured, requiring manual procedures to extract and use it. The same is true for the text-based disclosures.
These features reflect a reporting system designed well before the emergence of machine learning methods. Their arrival has since complicated the use of the EDGAR filing system. On any particular day, as much as 85 percent of the documents visited are by internet bots. But that does not mean that the Commission has not been preparing for this day. The first rule mandating a machine-readable disclosure dates back to 2003. And more than a dozen other rules requiring structured disclosure have been proposed or adopted since then.
Addressing the unique requirements of machine processing of financial disclosures is regularly tackled through the notice-and-comment rulemaking process. And this is done each time a change in registrant disclosure requirements is considered. On occasion, humans and machines may have competing needs. But on all occasions, the Commission has sought to preserve the ability of an investor to easily open up a prospectus, annual report, or other registrant filing to evaluate the merit of the required disclosures.
The key innovation of our developing disclosure technology is making machine accessibility invisible to the rendering of a document for human readability. This is illustrated well by a recently proposed rule that would require SEC reporting companies to file their periodic reports in Inline XBRL. Currently, filers separately report a human-readable html version of a periodic report and a machine-readable version in an eXtensible Business Reporting Language (XBRL) format. This proposed rule, if adopted, would combine the two requirements and create a single document designed to be read equally well by humans and machines.
From a machine-readability perspective, the financial statement data, footnotes, and other key information contained in an Inline-XBRL filing can be easily and automatically extracted, processed, and combined with similar data from other 10-K filings. This aggregation is possible because each of the extractable data elements or sections of textual information is tagged using definitions from a common taxonomy of reporting elements.
From a machine learning perspective, this standardized data can be combined with other relevant financial information and market participant actions to establish patterns that may warrant further inquiry. And that can ultimately lead to predictions about potential future registrant behavior. These are precisely the types of algorithms that staff in DERA are currently developing.
From a human perspective, you can see it for yourself. More than 100 companies are already voluntarily filing with the SEC using this technology. On SEC.gov these filers have an “iXBRL” label next to the html version of their 10-K filing. Click on one to see how a periodic report functions with interactive features not otherwise available from an html filing.
From an overall perspective, this is a good place to pause and remind everyone that the SEC is fundamentally committed to ensuring that all investors and market participants can access the information necessary to make informed financial decisions. But another aspect of the agency’s commitment to investor protection involves the use of sophisticated data analytics to ensure that we have insight into the market, particularly as we seek potential market misconduct.
Some Myths To Dispel About Machine-Readable Reporting Standards
Over the years we’ve encountered a great many learning opportunities in pursuit of making the information in SEC disclosures more accessible to quantitative uses. In many instances, there are common perceptions about data and information access that are misguided, or even wrong. I would like to share a few of them with you here. In some ways it may seem that I am treading old ground, raising issues that we faced from the earliest days of machine readability. But I believe that these misconceptions—these myths—persist no matter the novelty of the technology harnessing the data. And we ignore them at the peril of further innovation.
Myth #1: Electronic access is equivalent to machine readability.
It is often assumed that if a document is electronically accessible then it must also be machine readable. This is not true. The misnomer results from confusion over the term “electronic access,” which many believe to mean “digitally” accessible. When EDGAR was first launched in the mid-1990s, investors expected to download physical documents electronically over the internet. This marked a major innovation over visits to library reading rooms and microfiche. Real-time access to information revolutionized information processing in financial markets.
But just because a document can be downloaded over the internet does not mean that it can be ingested by a computer algorithm. A document stored in an electronic format, and available for download over the internet, can be impenetrable to machine processing. Particularly if it is scanned, stored in a proprietary format, or beset by security settings. And if there is no reporting format telling the machine what it is reading, then it may be impossible to make heads or tails of information ingested.
To be sure, electronic access is a necessary component of machine readability. But it is an insufficient condition. For advanced machine learning algorithms to generate unique insights, there must be structure to the information being read.
Myth #2: The Commission alone develops the reporting standards incorporated in its rules.
This leads us to another myth among some market observers—that reporting formats are ad hoc and nonstandard. To the contrary, considerable thought is given to reporting formats during the notice-and-comment rulemaking process. And under the National Technology Transfer and Advancement Act, also known as the roll-off-your-tongue “NTTAA,” Federal agencies are required to use technical standards developed by voluntary consensus standards bodies. That is, we borrow from standards developed and/or endorsed by external groups, whenever possible.
This is what the Commission did with the adoption of XBRL for financial statement reporting in 2009, which is an open standard format that is widely available to the public royalty-free at no cost. The standard originated from an AICPA (American Institute of Certified Public Accountants) initiative and was ultimately given its own organizational standing—XBRL International—that now has more than 600 members. And XBRL is now in use in more than 60 countries.
XBRL is not the only externally developed reporting standard that the Commission has considered. In 2015, the Commission proposed rules to require swap data repositories to make security-based swaps data available according to schemas that were published on the Commission’s website. The first international industry standard referenced was “FpML” (Financial products Markup Language), originally developed under the auspices of the International Swaps and Derivatives Association (“ISDA”). The second was “FIXML” (Financial Information eXchange Markup Language), which is owned and maintained by the FIX trading community.
One of the innovations of this proposal, and one that I find personally satisfying, is that the Commission proposed to accommodate both industry standards. While they are not interoperable, the Commission sought to maximize compliance flexibility by developing a common data model that uses as a basis the existing overlap of each standard’s current coverages of security-based swap data. Thus, security-based swap transaction records structured according to either the FpML or FIXML schema could be immediately aggregated, compared, and analyzed by the Commission.
Myth #3: Retail investors don’t need machine-readable data.
Myth number three—retail investors don’t need machine-readable data. It is an unfortunate but common refrain among some market observers that the average retail investor does not benefit from structured data disclosures, such as those made using XBRL. This is translated more broadly to mean that machine-readable data is unnecessary for most investors. They arrive at this conclusion because processing the files can require specialized software and aggregating the information into usable datasets for analysis requires specialized skills. And as a result, only sophisticated (and resource-rich) investors benefit.
What this assertion ignores is that structured disclosures enable third-party vendors to make this information available to retail investors at low or even no cost. Machine-readable disclosures fuel many online financial tools popular with investors. Look no further than Yahoo or Google finance. They report easily accessible financial statement information from public companies. And if investors want this data organized across filers for comparison and analysis, they can access it directly from SEC.gov. The Commission staff regularly lowers the burden of accessing and analyzing the data from forms and filings by making condensed data sets available on the SEC website. These data are even used by large data aggregators. The Google Cloud Platform recently included the SEC’s Public Dataset in its cloud platform.
So while it may be true that many investors do not directly use structured data, the fact is that they do consume the data downstream. Such access would be impossible without structured data. This is particularly true for smaller SEC reporting companies. Their financials would escape coverage by data vendors, and thus market analysts, if required to be manually extracted from filings. This was the case prior to the commencement of XBRL reporting in 2009. At that time, only 70% of SEC reporting companies received coverage by major data vendors. Those without coverage were predominantly smaller companies—companies of insufficient investment scale to merit the cost of manually collecting the information.
Myth #4: Requiring machine-readable reporting standards ensures high-quality data.
On to myth number four—machine-readable reporting standards ensure high-quality data. Not true. Despite claims to the contrary, computer algorithms can’t fix poorly reported data; they can only maximize its usefulness. Unless reporting entities comply with both the letter and the spirit of promulgated reporting requirements, a well-designed standard may still be insufficient for today’s advanced analytics to generate unique insights about market behaviors.
To give an example of what I mean, consider a date field in a structured disclosure format. A filer can comply with the reporting standard by entering a valid date, but if the date doesn’t match to the event or action being reported, then a machine learning algorithm will be assessing incorrect information. No amount of data format validation can fix a reporting error.
A more subtle example is the use of extensions to standard taxonomies. In particular, no reporting language can reasonably account for every registrant action. So provisions are made, for example with XBRL reporting, to allow filers to extend the standard taxonomy to reflect non-standard items. But if discretion is used to create an extension when a standard reporting element should reasonably be used, then comparability across filers is unnecessarily diminished.
Today’s advanced machine learning algorithms have done much to extract usable information from less than perfectly reported data. Some of these innovations are remarkable. To give an example, at the SEC, if a ticker symbol is misreported, we have algorithms than can suggest the correct symbol in light of other information reported. And it comes with a score on the likelihood that the alternative ticker is correct.
But from a market supervision and investor perspective, there is no substitute for reporting accurate data in the first instance. To this end, Commission staff report observations about the quality of information being reported to help facilitate compliance with both the letter and spirit of the rules.
Myth #5: We don’t need the public’s views any more.
The final myth—we don’t need to hear from you. Those who know data the best often just assume that we know their views and will do the “right” thing when it comes to implementing new reporting requirements. I mentioned earlier how the Commission considers a range of technological and data structuring options when considering new or amended financial disclosures. The agency typically does so in the form of notice-and-comment rulemaking. It is vital that we hear from the consumers of data, from the experts who know best how the data could be used. Because while we have considerable in-house expertise, there is no substitute for hearing directly from the public.
This need persists regardless of the “subject” of the disclosure. Whether the agency is addressing disclosures from public companies, broker-dealers, investment advisors, clearing agencies, or credit rating agencies, the fundamental issues that make data useful and usable remain. We hear a lot from market participants on the value of the information being disclosed. We hear far less frequently on the manner in which it should or could be disclosed. As experts in the data management field, I urge all of you to take the time to let your thoughts be known. And I look forward to hearing what they are.
I started these remarks by acknowledging that what has fueled the machine learning revolution is data. And not just any data, but data designed to answer questions that market participants ask. Sophisticated algorithms depend on this data being of high quality and being machine readable. When applied to the emerging fields of SupTech and RegTech, there is tremendous potential for enhanced regulatory compliance. The enhancements can come at a lower cost to registrants. Along with all of you, I look forward to the future benefits to regulators, investors, and market analysts.
 The Securities and Exchange Commission disclaims responsibility for any private publication or statement of any SEC employee or Commissioner. This speech expresses the author’s views and does not necessarily reflect those of the Commission, the individual Commissioners, or members of the staff. A special thanks to Vanessa Countryman, Hermine Wong, Michael Lim, Mike Willis, and Pam Urban for their valuable feedback and contributions.
 Scott W. Bauguess, The Hope and Limitations of Machine Learning in Market Risk Assessment (Mar. 6, 2015), http://cfe.columbia.edu/files/seasieor/center-financial-engineering/presentations/MachineLearningSECRiskAssessment030615public.pdf.
 Scott W. Bauguess, The Role of Big Data, Machine Learning, and AI in Assessing Risks: a Regulatory Perspective, OpRisk North America (June 2017).
 Internal DERA analysis using EDGAR master index files (https://www.sec.gov/Archives/edgar/full-index/) through April 2017. Estimated filings based on unique document accession numbers during this period; estimated reporting entities based on unique SEC central index keys associated with the filings; and the estimated number of document types excludes amended filings.
 Internal DERA analysis of SEC.gov web traffic.
 Internal DERA analysis of SEC.gov web traffic.
 Inline XBRL Filing of Tagged Data, Release No. 33-10323 (Mar. 1, 2017) [82 FR 14282], https://www.sec.gov/rules/proposed/2017/33-10323.pdf.
 See Sample Form 10-Q Filing at: https://www.sec.gov/ixviewer/samples/bst/out/bst-20160930.htm
 National Technology Transfer and Advancement Act (NTTAA) 15 USC 3701 (1996) (“…Federal agencies and departments shall use technical standards that are developed or adopted by voluntary consensus standards bodies, using such technical standards as a means to carry out policy objectives or activities determined by the agencies and departments.”)
 Interactive Data to Improve Financial Reporting, Release No. 33-9002 (Jan. 30, 2009), p. 149 [74 FR 6776] (“We also note that XBRL is an ‘open standard’ format and its technological specifications are widely available to the public royalty-free at no cost.”), https://www.sec.gov/rules/final/2009/33-9002.pdf.
 History of XBRL: Karen Kernan, XBRL: The Story of Our New Language, https://www.aicpa.org/content/dam/aicpa/interestareas/frc/accountingfinancialreporting/xbrl/downloadabledocuments/xbrl-09-web-final.pdf.; information about XBRL consortium: https://www.xbrl.org/the-consortium/about/membership-list/.
 Establishing the Form and Manner with which Security-Based Swap Data Repositories Must Make Security-Based Swap Data Available to the Commission, Release No. 34-76624 (Dec. 11, 2015) [80 FR 79757], https://www.sec.gov/rules/proposed/2015/34-76624.pdf.
 Interactive Data to Improve Financial Reporting, Release No. 33-9002 (Jan. 30, 2009), pps. 125-126.
 See, e.g., “Staff Observations, Guidance, and Trends” https://www.sec.gov/structureddata/osdstaffobsandguide.