Is Bigger Data Always Better?Preqin
Next in our blog series taken from our Don’t Believe the Hype whitepaper, we discuss how best to model data and why too much data is not necessarily a good thing.
Modelling the Data
Even when all the available data has been collected and logged in the dataset, the work has barely begun. Raw data, especially when it relates to real-world activities, can often paint a confusing, fragmented and contradictory picture. This makes it difficult to know how the data relates to what is really going on.
For example, say that there are three available sets of performance data from three different pension funds all stating the returns they have seen from a particular fund. They are all slightly different, but the fund cannot be performing at three different levels. The differences lie in the specific commitments the pension funds made and the details of the LPAs they signed, but that information is unavailable. How can these disparate figures be reconciled to produce the most accurate picture?
Every data provider, then, has to introduce some curation and modelling to the raw dataset in order to present actionable intelligence rather than a confusing sprawl of data. By its nature, this requires making some assumptions, performing some calculation and extrapolating from known precedents. But this is a double-edged sword: to introduce assumptions and processes is to introduce bias or obfuscation into the dataset which may not be useful to the end-user.
To return to the performance problem, which route is best? To take the performance from the largest investor? The one considered most reputable? To take an average of the figures? And if so, should it be weighted by commitment size? None seem like desirable options, and all can result in providing information that does not benefit someone looking into the performance of the fund in question.
There are two key ways to mitigate this. The first is to make intelligent models that are cross-checked and corroborated to be consistent. Preqin does this by cross-referencing data points with each other. We then query inconsistent data with fund managers or investors themselves, asking them to provide further evidence of the figures in question. This ensures that we present as cohesive a dataset as possible – ultimately, if our researchers are not satisfied with the validity of a data point, we will not include it in our models.
This goes beyond human quality controls and into the aggregating calculations that data providers make. For instance, Preqin has a cash flow modelling tool which helps investors predict when they might expect capital calls and distributions from a fund. This is based on the historical cash flow data of more than 4,300 funds, which enables us to benchmark expected cash flow timings. Having a reliable and robust benchmark means that outliers are more quickly identified and, if needed, flagged for further review.
The other key factor is to be transparent about the sources of data, especially in cases where there are multiple available data points saying different things. Fund performance is a prime example of this, as noted in the example above. Where information is gathered from different sources, it is important to recognize that different users will find different sources more relevant or useful. As such, we make all the source data for performance information available, and when building benchmarks or examining fund performance, users can switch between the information sources that best suit their needs.
Is Bigger Always Better?
There are other assumptions that data providers make on behalf of users that go in the other direction – including more information than is useful, rather than condensing multiple sources. There is a frequently drawn conclusion among both data providers and commentators that big numbers must always be better – an attitude that extends well beyond financial data into all walks of life. However, it is not always a useful exercise to gather as much information as possible and break it into separate data points, and it can be disingenuous to claim that these figures represent a larger known universe.
For example, if a hedge fund has 20 known share classes that all operate on a master/feeder structure, they should plausibly be considered one entity rather than 20. By the same token, if the master fund represents the pooled assets of the feeder fund, to count those assets twice over would not be a fair reflection of the fund’s size. Similarly, if an insurance company has 10 sibling entities that all make investments in alternative assets, but through a single investment arm of the parent corporation, it must be considered a single investor rather than 10 exactly-aligned-but-apparently-separate ones.
The inaccuracies in the ‘bigger is better’ approach go far beyond a misstated headline figure. It can actively obscure the data’s relevance to the real world and have material consequences for users of the dataset. A fund manager might approach one of the insurance companies with a fund pitch, only to find out that they do not do any investing on their own behalf. An investor may take a hedge fund’s double-counted size as a testament to its appeal and change its investing decisions accordingly.
This highlights the importance of data curation, and that data providers must constantly be relating their figures back to the activities of industry participants to ensure that they reflect reality. Whether the initial information is gathered by human interaction or machine learning, it needs to be checked for quality, corroborated and tied back to reality to be of use.
Preqin believes this requires a human touch. While machines may one day be able to make judgements on information quality and perform corroborating checks, we trust that trained, engaged and intelligent human curation will always result in the best-quality data that is most accurate and actionable.
Article by Preqin