**An Aggregation Algorithm for Blockstack**

Efe A.Ok* Pietro Ortoleva† Ennio Stacchetti‡

December 16, 2018

**1** **Introduction**

The primary objective of this report is to introduce a dynamic method of aggregation for the various types of scores that an app may get from a number of app review companies. We also describe a particular payment scheme for the best performing apps, and moreover, discuss methods of evaluating the performance of the app review companies themselves after, say, one year from the initial period.

**2** **Description of the Algorithm**

The inputs of the aggregation algorithm described below are the scores supplied by a given list of app review companies (for each app to be evaluated) at periods *m *= 0*, *1*, *2*, ... *At present this consists of three companies, namely, *Democracy Earth*, *Product Hunt*, and a third one yet to be determined; for the purpose of this discussion, just as an example, we will use Undertesting.com. In what follows, we denote these companies by DE, PH and UT, respectively. In turn, we adopt the following notation for the raw data we obtain from these app reviewers:

the number of up votes App | |

the number of down votes App | |

the PH team score of App | |

the PH community score of App | |

UT score of App |

Note that and can be any positive integers, but the remaining six scores are all integers between 0 and 100. As we shall explain below, the algorithm considers the possibility that some, or even all, of these inputs may be missing for an app.

The aggregation algorithm we suggest is recursive, that is, in every period of application other than the initial one, it uses the scores obtained in the previous periods. In addition, it is applied in four steps in the initial period of application, and in five steps in the future periods.

*Department of Economics and Courant Institute of Applied Mathematics, New York University.

†Department of Economics, Princeton University

‡Department of Economics, New York University.

In what follows, we describe these steps in detail, often pausing to give the justification for their structure. Implementation of this algorithm requires Blockstack to choose six parameters. We explain below a guideline for choosing these parameters, and then suggest specific values that appear rather reasonable for the initial implementation of the procedure. (Should one wishes to do so, these choices can later be adjusted effortlessly.) In addition, and mainly for the programmers, we outline the main part of our algorithm in purely formal terms in the Appendix of this report.

**2.1** **Notation**

In what follows, the *mean *and *standard deviation *of any given collection of numbers *a*1, ..., *a*n are denoted as Avg(*a**j*) and Std(*a**j*), respectively. Put precisely, we have

(Throughout this report, we use the notation := to define the left-hand side of a given expression by its right-hand side.)

**2.2** **The Aggregation Procedure**

**2.2.1** **INITIAL PERIOD (Period 0)**

**Step 1. Normalization of Scores**

The main objective of the algorithm in the initial period is to aggregate all scores that an app gets from different sources into a single score. To be able to do this meaningfully, these scores should be measured on a relative scale, independently of the measurement units. (For instance, if we multiply every score that every app gets by 10, the final ranking of the aggregate scores should not change.) To achieve this, we adopt a standard procedure of descriptive statistics, and convert the raw scores into *z-*scores. That is, for each app, we normalize each score (of a given type) by first centering that score around the mean (of all scores of that type) and then dividing the resulting quantity by the standard deviation (of all scores of that type). The resulting *normalized *score of an app tells us to what extent the score of that app is above or below the mean score, measured in standard deviation units. For instance, the normalized score of 0 means that the corresponding raw score is exactly the average of all raw scores (of the same type). On the other hand, a normalized score of –1 means that the corresponding raw score is one standard deviation below the average raw scores (of the same type), and that of means that the corresponding raw score is one half standard deviation above the average raw scores (of the same type).

Step 1 of our procedure in period 0 that pertains to PH and UT is nothing but carrying out the said normalization for each type of the score that an app may get from these companies. In the case of the scores that an app can get from DE, however, there is an additional wrinkle. For, the number of votes that an app gets in DE conveys information about how “popular” that app is, independently of how “liked” it is. To wit, consider two apps, App 1 and App 2, that have been out for more or less the same amount of time.

Suppose App 1 has received two up votes from DE and no down votes, while App 2 has received 1 up vote and 1000 down votes. While, on the net scale, App 1 appears more successful than App 2, insofar as the attention received in the market App 2 overperforms App 1.(1)

To account for “positive popularity” and “popularity” effects separately, therefore, we convert the votes of apps from DT into two different scores, one measuring how liked an app is, and the other measuring how much traction it gets in the market. (In the implementation of the algorithm, one may choose to weigh these scores differently, say, by putting more weight into *positive *popularity; see Step 3.)

**Step 1.a **[DE] Suppose DE reports in period 0 that App *i* has received at least one vote. We then create two indices for this app. First, note that

is the *net *up-vote for App *i* in DE as a percentage of the total votes this app gets. With the motivation outlined above, we normalize this ratio by first centering it around the mean (of all these ratios as *i* varies), and then dividing it by the standard deviation (of all these ratios). This leads to the index

provided that Std This index measures “how liked” App *i* is, at least by the participants of DE. Clearly, may be positive or negative. Put precisely, if the percentage of the *net *up-votes for App *i* is higher than the average percentage of the *net *up-votes across all the reviewed apps (each with at least one vote). And , if the percentage of the *net *up-votes for App *i* falls below this average.

Second, recall that is the total votes that App *i* receives in DE in period 0. Then, if we normalize this statistic, we obtain

provided that the denominator is not zero. This index measures “the relative size of traction” that App *i* gets from the voters in DE in period 0, may these be up or down. Again, may be positive or negative. In particular, an app that gets less votes than the average receives a negative score.

(1) Perhaps a concrete example from the online video gaming world may be illustrative here. Bungie’s famous 2014 game DESTINY has generally received subpar reviews from the critics. (The Metacritic score of this game is, for instance, 75/100.) And yet in 2014 it was the third best-selling game in the US. (By the end of 2015, DESTINY was purchased by 20 million users worldwide.) By contrast, FromSoftware’s 2011 game DARK SOULS has received raving reviews and obtained several “Game of the Year” awards. (The Metacritic score of this game is 90/100. Indeed, DARK SOULS is presently ranked by PC Gamer as the fifth in the “Best RPGs of all time” list.) And yet, this game is famously difficult, and as a consequence, with regard to mere popularity, ranked way below DESTINY. (It sold about 2.37 million units worldwide.) Deciding which of these two games is “more successful” obviously depends on the criteria with which one chooses to assess “success.”

**Step 1.b **[PH] Converting the data into standard units (again, to ensure independence of units of measurement), we compute

provided that the denominators of these fractions are nonzero. (that is, if every app received the same score in period 0 from the PH team). Similarly, if Std .

These are the normalized period 0 scores of App *i* from the PH team and community, respectively.

**Step 1.c **[UT] Again, we convert the data into standard units

provided that Std

These are the normalized period 0 scores of App* i *from UT in each category *k*.

**Step 2. Extension of Normalized Scores to Incorporate Missing Data**

At this stage of the procedure, we turn to those apps that have not received any votes in DE in period 0, or were not evaluated on a given category or by a given app reviewer. As a general principle, we assign –1 as the normalized score of such apps for every score they miss. Given the nature of normalization we have introduced in Step 1, this means that these apps receive the score of the app that sits (in terms of its raw score) exactly one standard deviation *below *the average. Thus, the procedure punishes these apps, but does not necessarily give them the worst evaluation. Yet, this value is only suggestive. It can be replaced with smaller values (say, –, –2, –3, etc.), if the implementer wishes to regard missing data as a more pressing negative signal about the quality of the app. Conversely, one may assign higher values than –1 (say, –) if she wishes to downplay this signal, and bring the “imaginary” (normalized) score of the app closer to the average (which is 0).(2)

(2) Note that, in general, this procedure maintains that an app that has not been evaluated by *any* reviewer may obtain a higher score than an app that has been reviewed by all three, but received low scores. If one wishes to avoid this situation, (s)he can simply exclude the apps that have not received *any* ranking from the list of apps.

**Step 2.a **[DE] Let us introduce the following dummy variable:

Thus, equals 1 if App *i *has received at least one vote from DE voters in period 0, and 0 otherwise. We extend the first index defined in Step 1 for DE as follows:

Clearly, equals if App *i* receives at least one vote in DE , while it equals –1 otherwise.

A similar extension is not needed for the “market traction” index , because the problem of not “receiving any votes” is already accounted for in that index.(3) Thus, we keep that index as is, except that we modify our notation for it:

**Step 2.b **[PH] In the case of PH, the apps that are not reviewed in period 0 by either the team or the community will have missing data. Again, we assign to such apps the normalized score of the app whose raw score is exactly one standard deviation below the average raw scores, namely, –1. To this end, we consider the following dummy variable

and define analogously. Then, we define

For instance, equals if the PH team has supplied a review for App *i*, and it equals –1 otherwise.

**Step 2.c **[UT] We extend the four indices we defined for UT in Step 1 precisely the same way to account for apps that have not been evaluated in period 0 in a given category. Put precisely, we put

and then define

(3) The only exception to this is the situation where all apps get exactly the same number of votes in DE. We abstract away here from this remote possibility.

for each *k* = 1, 2, 3, 4.

**Step 3. Aggregation within the Reviewers**

In the case of UT, we have four scores, one for each category. In the case of PH, we have two scores, one from the team and one from the community. In the case of DE, we again have two scores, one measuring the “desirability” of the app, and one measuring the “attention” that app receives in the market. In this step we aggregate the scores that an app has received from any one of these reviewers into a single score to obtain “the” reviewer score of that app. It seems quite natural that we use a weighted average scheme for this, where the weights are to be determined by the implementer of the algorithm (Blockstack).

**Step 3.a **[DE] We define the (normalized) period 0 score that App *i* receives from DE as

where α is a real number between 0 and 1. The number α reflects the relative importance of the criterion that “App *i* is liked in the market” compared to the criterion that “App *i* has made noise in the market.” If one chooses α = 1, then this corresponds to the situation where the implementer cares only about whether or not App *i* is liked in the market (even if this may be by a small group of voters). At the other extreme, if one chooses α = 0, then this corresponds to the situation where the implementer cares only about whether or not App *i* has gotten good traction in the market (even though the app is not found appealing by a large group of voters). Or, if one wishes to give equal weight to “desirability” and “popularity,” (s)he can set Indeed, it does seem reasonable that one should choose α strictly between 0 and 1 giving weight to both criteria in the final score. This is truly up to the discretion of the implementer, but if only as a rough guideline, we think it makes sense to give a bit more weight to how much an app is liked in the market to how much traction it got, and hence suggest using That is, our recommendation is to use the index

as “the” period 0 score App *i* receives from DE.

**Step 3.b **[PH] We define the (normalized) period 0 score that App *i* receives from PH as

where, again, α is a real number between 0 and 1. The “use” of α in this formula is analogous to its use in ScoreDE(*i*; α) discussed above. The actual choice of α is again up to Blockstack, but it seems reasonable to us that the evaluations of the PH team and community should be weighed equally, which suggest using the index Thus, our recommendation is to use the index

as “the” period 0 score App *i* receives from PH.

**Step 3.c **[UT] We define the (normalized) period 0 score that App *i* receives from UT as

where **α** := (α1, α2, α3, α4) is a vector of four nonnegative numbers that sum up to 1. Once again, choosing the exact value of this vector is due to the implementer of the procedure. If, for instance, one wishes to weigh the category 1 as the most important category, category 2 as the second most important, and category 3 and 4 as equally important, she can use the weight vector (These weights render category 1 twice as important as category 2, and thrice as important as categories 3 and 4.) At present, we do not have any reason to adopt unequal weights for the involved categories, so suggest using the index Thus, our present recommendation is to use the index

as “the” period 0 score App *i* receives from UT.

**Step 4. Aggregation across the Reviewers**

In this step we aggregate the normalized period 0 scores that an app has received across the three reviewers. Our objective is to do this in a way that may help reduce the impact of a single app reviewer score on the total score. After all, it seems to us that exerting an influence on all three of the reviewers (by means of bribes or other types of manipulations) may be much for costly than affecting the score of only one reviewer. Manipulating one (or to a lesser extent two) of the review companies is, at least theoretically, an option that an app can opt for, unless this is somehow made costly by the aggregation method in effect. To reduce the incentives to do so, our method uses an aggregation method across reviewers that is strictly concave for positive (normalized) scores, and strictly convex for negative (normalized) scores.

Put precisely, we define the function

where The final period 0 score of App *i* is obtained as:

For the initial stage of the implementation, we suggest using in this formula. Thus, our present recommendation is to use the index

as “the” period 0 score of App *i*.(4)

(4) Notice that each app reviewer is given the same weight of importance in this aggregation. One can of course choose to do this differently, but we have no reason to assign unequal weights to the review companies at present. If, in the future, it is discovered that one of these companies is more open to manipulation than others, one may consider transferring some of the weight of this company to the other two.

We plot the function in Figure 1. In general, the lower the *θ*, the more concave the transformation Φ(·,*θ*) for above-the-average (normalized) scores, and more convex for those that are below-the-average. For (normalized) values above the average, lower *θ* means that less affect is attributed to any single reviewer, thereby reducing the impact of any *one *app reviewer. In turn, again because it may simply be very costly to bribe all three reviewers, this may help reduce the overall impact of potential manipulation. Similarly, the convex transformation for negative values reduces the incentive for an app to lower the ranking of its competitors by manipulating any *one *review company.

Figure 1: The function Φ

Perhaps an example would better highlight the upshot of this discussion. Consider an app, say App *i*. This app wishes to move up in the overall ranking by about *e* > 0 standard deviation, and for this, it is willing to bribe, say, the DE voters. Suppose App i is currently ranked above average (in standard deviation units) by DE, that is, ScoreDE(*i*) > 0. A quick calculation shows that this requires App *i* to raise its DE score ScoreDE(*i*), to

To make things more concrete, suppose App *i* wishes to increase its overall ranking by 0.3 (that is, *e* = 0.3). Given the (convex) payment scheme we suggest below, this is not unreasonable. (For, small upward shifts in the mid ranks make relatively small monetary returns.) Now, how many of the votes that this app must purchase depends on its original DE score. If the original (normalized) score of App *i* from DE is, say, 0.2, then the formula above says that it would have to increase this to 1.81. (In terms of the α-run of the algorithm, this would essentially mean App *i* receiving *all *of the DE votes.) And indeed, it becomes increasingly costly to manipulate the algorithm the higher is the original score of App *i* from DE. If, for instance, were 0.8, then App *i* needs to increase its DE score to 3.2 (in standard deviation units). This may be excessively costly.

It is easier to manipulate a score that is near the average. If, for instance, and App *i *wants to increase period 0 score by *e* = 0.3, then it is enough to increase to 0.81 to achieve the desired manipulation. This still requires a large amount of vote buying. For a more modest target, say *e* = 0.1, it would be enough to increase from 0 to 0.09 units. Thus small manipulations for average scores are more feasible, but they are also less profitable given the current payment scheme.

**2.2.2** **FUTURE PERIODS** (Period *m* __>__ 1)

**Step 1. Normalization of Scores**

**Step 1.a **[DE] Suppose DE reports in period *m* that App *i* has received at least one vote. We then compute exactly as described above (but, of course, this time using the period *m* data for any

**Step 1.b **[PH] Converting the data into standard units (again, to ensure independence of units of measurement), we compute exactly as described above (but, of course, this time using the period *m* data). These are the normalized period *m* scores of App *i* from the PH team and community, respectively.

**Step 1.c **[UT] Again, we convert the data into standard units and find for each *k* = 1, 2, 3, 4 (using the period *m* data). These are the normalized period *m* scores of App *i* from UT in each category *k*.

**Step 2. Extension of Normalized Scores to Incorporate Missing Data**

**Step 2.a **[DE] We define

A similar extension is not needed for the “market traction” index . Thus, we keep that index as is, except that we modify our notation for it:

**Step 2.b **[PH] In the case of PH, the apps that are not reviewed in period *m* by either the team or the community will have missing data. To deal with this, we define:

and

**Step 2.c **[UT] We extend the four indices we defined for UT in Step 1 precisely the same way to account for apps that have not been evaluated in period *m* in a given category. Put precisely, we define

for each *k* = 1, 2, 3, 4.

**Step 3. Aggregation within the Reviewers**

Assuming we keep the same weights we suggest for period 0, we define the (normalized) period *m* score that App *i* receives from DE, PH and UT as

**Step 4****. Aggregation across the Reviewers**

The final period *m* score of App *i* is obtained as:

assuming we use the same recommendation for the concavity parameter that was used in period 0.

**Step 5****. Aggregation over Time**

As they will be evaluated periodically, each app will receive new scores in each review process. Developers will be improving their apps and new apps will appear from time to time, so the assessment of the reviewers is likely to change over time. To account for the fact that previous scores are informative and that current scores incorporate more recent evaluations by the reviewers, we propose a discounted aggregation of the scores over time.

Let *β* be a real number between 0 and 1 that represents the discount factor. (This factor is the last parameter to be chosen by the Blockstack.) The score of App *i* at period *m* is Score*m*(*i*). The one-period discounted score of this app is, in turn, *β*Score*m*-1(*i*), and the two-periods discounted score of it is *β*2Score*m*-2(*i*) and so. Thus, the discounted total of all the scores that App *i* has received in periods 0 to *m* is obtained as

Not to give an unfair advantage to apps that happen to be in the market for a longer period of time than the others, we normalize this total as

Thus Total*m* (*i*) is essentially a weighted average of all the scores that App *i* has received up to period *m*, namely Score0(*i*), ..., Score*m*(*i*), where the weight of Score*k*(*i*) is ,* k* = 0, ..., *m*. Note that older scores are discounted more heavily and hence, count less in the total score. After the initial round of evaluation, Total0(*i*) = Score0(*i*), and for any subsequent round *m*, the aggregate score can be easily updated from the previous round aggregate score by using the formula:

That is, Total*m*(*i*) is a weighted average of the current score and the aggregate score obtained in the previous round. For example, if *β* = 0.8, then

**2.2.3** **New Apps**

In some periods new apps will enter the pool. It is important that the reviewers are made aware when a new app arrives, and that they make sure to include it in their evaluation process as soon as it arrives. This is crucial for the method that is applied for missing data pertaining to a new app.

Suppose App *i* arrives to the pool in period *m*. Let be its corresponding scores in period *m*, computed as described in Section 2.2.1, with the proviso for missing data that applies to the initial period. (Note that the algorithm has been in action for *m* periods, but it is period 0 insofar as the new app is concerned.) Let Score*m*(*i*) be its corresponding aggregate score. Then initialize Total*m*(*i*) = Score*m*(*i*). So, when a new app appears in the pool, its first score is computed as the initial score for any other app, and its aggregate score over time is initialized with its first score.

**2.3** **Addendum: Dealing with Ties**

Especially in the early phases of the mechanism, when many apps have not been evaluated by some of the reviewers, it may happen that two or more apps would get the same final score, and hence tie for a position in the final ranking. In this case, it seems reasonable that the payments to the apps are equally distributed.

To be precise, suppose, in a given month, Blockstack decides to make a stream of payments to the top *k* apps, say, *p**1*, ..., *p**k*. Then, if *k*1 many apps score the highest, each is paid

and if *k*1 < *k* and *k*2 many of them score the second highest, each is paid

and so on.

Note that the algorithm above can be set to be always implemented. If there are no ties, it simply pays the first *k* apps as desired. When there are ties, however, the algorithm guarantees that any app at the same level receives the same amount, that apps at higher levels are paid more, and that the total amount paid does not change. What the presence of ties may introduce is that, in this case, *more *apps are paid, with the bottom ones being paid less. For example, suppose that the mechanism pays 100 to the first app, 90 to the second, 80 to the third, and nothing below. If there are no ties, three apps will be paid, for a total of 270. But now suppose that there are 4 apps tied for the third place. In this case the first and second app will be paid like before (100 and 90, respectively); but now the four apps ties at the third place will share the prize designed for it, getting 20 each. This means that more apps, six of them, are paid. In general, the presence of ties may imply that more apps are paid.

**2.4** **Addendum: The Payment Scheme**

We propose to use the following payment scheme. Blockstack selects a total budget *M*, a percentage *p*, and a maximum number of paid apps (which could be infinity). Then, it pays *p* of *M* to the first app; *p* of the remainder to the second app; *p* of the remainder to the third app, etc. The *n*th app is paid *p*(1 - *p*)*n-1**M*. This proceeds until app is reached and/or there are no more apps. The remaining balance (which equals , and is likely to be small), is then divided equally between the first apps and added to their payments.

Note that the choice of *p* affects how different the payments are for each app. When *p* is high (say, 50%), payments sharply decline as we go down the ranking. When *p* is low (say, 1%), different apps are paid fairly similar amounts. The following two examples illustrate these points.

*Example 1. *Set *p* = 50%, *M* = 100, = 10. In this case, the first app is paid 50, the second 25, the third 12.5, the fourth 6.25, etc. That is, payments decrease very quickly as the rank decreases. Note that the amount left after the first 10 apps are paid is very small (0.049 out of the original 100). This amount is then divided in 10 and added to the payments above.

*Example 2. *Consider instead the case in which *p* = 1%, *M* = 100, = 10. Here, from the first step of the mechanism the first app is paid 1, the second 0.99, the third 0.9801, etc. But of course, after the first 10 apps are paid there is a large amount left: 89.534. This is then divided equally between the 10 paid apps, leading to the first app being paid 9.9534, the second one being paid 9.9434, etc. Thus, payments are very similar for all paid apps.

As the two examples above illustrate, the choice of *p* and affects how many apps are paid, and how different their payments tend to be. Our recommendation is to set *p* = 20%, and choose as the amount that sets the paid amount at $1. (This number depends on *M*.)

There are four reasons for this recommendation. First, it gives relatively large incentives for apps to gain a higher ranking, as indeed the best apps are paid very high amounts. It may be argued that having some paid apps receive very large amounts may increase the visibility of the whole process. Second, this method entails that the largest gains are made by increasing an app’s ranking when it’s near the top. But, as we discussed in detail in Section 2.2.1 (Step 4), these are also the rankings that are harder to manipulate. Third, this method guarantees paying a large number of apps. This could be useful; although many apps receive small amounts, this may remind them of the incentive and induce them to try to improve their rank (through product enhancement). Fourth, amounts below $1 are negligible and creating more annoyance than anything else, this motivates stopping payments at this level. Indeed, little would change if the maximum payment were to be set at $10.

Finally, note that setting 20% is in line with the comments received after the dry run of the algorithm. While many suggested smaller percentages, others suggested higher ones, and it seems to us that 20% may be a good compromise.(5)

**3** **Evaluating the Reviewers**

As the final order of business, we discuss in this section how to evaluate the app reviewers. We suggest two criteria for this evaluation.

(5) More complex payment schemes could of course be envisioned. For example, one can adopt a varying percentages (as in paying to the best app *p**1**M, *to the second best app *p**2*(1-*p**1*)*M, *and so on, where *p**1*,* p**2*, ... is a decreasing (or increasing) sequence.) However, more complex schemes may be harder to communicate to the community.

**3.1** **Criterion A: Agreement with the Final Ranking**

A first possible criterion is to investigate whether a reviewer’s score is similar or different to the final, aggregate one, obtained after 12 months. There are two reasons to consider this final score as a benchmark. First, it is the ranking resulting from the most information — it aggregates a number of different reviews repeated over time — and may thus be considered the most accurate. Second, because these final scores will not be known until much later, it is harder for app reviewers to adapt to them beyond reporting their genuine evaluation of the app.

To construct the evaluation of app reviewers according to this criterion, let us denote by the normalized score of reviewer *r* for App *i*, and consider the final, global ranking after 12 months, namely, Total12(*i*). Then, we would define the score of reviewer r according to criterion A as

Note that these reviewer scores are always negative, and the better reviewers will have scores closer to zero.

There are two limitations of this criterion. First, because it punishes the variance with respect to a final score, it punishes reviewers with scores that are highly variable over time. For example, if the final score of an app is 1, a reviewer that reports 1 all the time has a much higher RevScoreA than a reviewer whose scores for that app alternate between 0.8 and 1.2.

The second limitation is that it leads reviewers to want to avoid having a highly unusual scoring rule. For example, suppose that two app reviewers follow methodologies that lead them to be largely in agreement, while a third one tends to give uncorrelated scores. Then, the last reviewer is more likely to receive a lower RevScoreA, unless his method is better at predicting the final score.

**3.2** **Criterion B: Agreement with Objective Criteria**

A second measure with which reviewers may be evaluated is by comparing their scores with external, objective criteria. For example, it could be possible to compute a score of different apps based on external financing, or based on number of users. One can then compute whether each reviewers’ score in each period agrees, or not, with this final objective ranking.

To this end, consider some objective criterion, and denote by Score*O*(*i*) the normalized score of each app according to this criterion *O*.(6) Then, construct the score of reviewer *r *according to criterion B, RevScoreB(*r*) as

As in the case of Criterion A, all review scores are negative here and the better reviewers will have scores closer to zero. Also like Criterion A, this criterion punishes reviewers with scores that are highly variable over time. On the other hand, reviewers have an incentive to best predict not only which app will be financed the most, but also which ones will be financed the least, and indeed the whole distribution.

(6) The normalized score is obtained following the same procedure as in Step 1 of our algorithm. For example, if financing is the objective criterion and if *F**i* is the total financing of app *i*, then .

**3.3** **Criterion C: Spotting Top Apps**

A third criterion to evaluate reviewers is their ability to spot great apps early. This could be a particularly desirable feature, as it may be particularly important for apps with great potential to receive funds early so that they reduce their risk of disappearing and continue to grow.

Consider the final, global ranking of App *i *after 12 months: Total12(*i*). Now consider the apps that are in the top 10% with respect to these total scores; let *T *denote the set of all such apps. We want to give credit to a reviewer that gives a high score to apps in this group earlier on.

Define the set as the set of all apps in the top 10% of the score of reviewer *r *in round *m*. Then, define

This index simply counts the number of times that the “final best apps” (those in *T*) are judged in the top 10% by reviewer *r*.

Note that reviewers with scores that fluctuate a lot over time may (although need not) have an advantage according to this criterion. Note also that this measure does not depend on the actual score that apps are given either by the reviewer or in the final ranking: all that it matters is the ability to identify top performers, even if their relative rank is different from the final one. This is markedly different from the approach used in the other two criteria above.

Finally, let us note that this criterion can be applied to the whole set of apps, or it can be run category by category. The latter method could be particularly useful in that it may incentivize reviewers to spot the good apps even in categories that are not too popular, thereby ensuring that the available apps are consistently good across the board.

**3.4** **Our Recommendation**

All three criteria above have advantages and disadvantages. We believe the community should be given the outcome of the rankings according to all three criteria — the third one being run both globally and by broad categories — in order to make their evaluations.

**APPENDIX**

Here we outline our aggregation procedure for an arbitrarily fixed App *i* in bare terms (with the parameters set to the ones that are recommended in Section 2). We only present here the period 0 algorithm, as the later period procedures are easily deduced from this (as described in Section 2.2.2).

**Step 1.** Compute

**Step 2.** Compute