Package weights

One of the common questions we’re seeing is about package weights, and I thought it’d be worth starting a thread for ideas and feelings on this.

A reminder that the “weight” is one of the factors in our payment formula, which is summarized here: https://tidelift.com/docs/lifting/paying

The overall effect of the formula is that both small, popular packages and large, niche packages can earn money on Tidelift; large, popular packages of course earn the most money.

I’m feeling great about the “usage” part of the formula - this is the royalty-like aspect, if a project builds and maintains a package that subscribers actually use, then they get paid in proportion to their success. But if I just upload some code to npm and nobody but me uses it, I get nothing.

The “weight” part is trickier. I thought I’d give some behind-the-scenes on what we’re doing right now and why, and leave this thread open a while for feedback and ideas.

Short story

There are some downsides to weighting by code size, but on balance we think it’s better than equal-weighting - a 2-line function and a huge framework simply shouldn’t get the same portion of fees because they aren’t the same amount of work or creating the same impact. We are mitigating the downsides in these ways:

  • The formula also incorporates usage; if you make a package that’s full of junky code, someone can make a forked version without the junk and take away your usage.
  • We adjust the code size measurement to remove things like generated code, vendored code, boilerplate, and tests.
  • We adjust the code size measurement to make different languages comparable.
  • We will actively shut down any identifiable attempts to game this, and since gaming it hurts other maintainers in your ecosystem, we expect that gaming attempts are likely to be reported. Also, we think most OSS maintainers are ethical.
  • We hope to incorporate some other measures into the weight over time.

Any way we split things up will be a little arbitrary and a little game-able.

We think “adjusted code size” has some virtues such as being fairly objective and having some relationship to maintenance effort. It’s better in our minds than equal-weighted for those reasons.

But better doesn’t mean perfect; very open to suggestions on how we should evolve the weight factor or what else should go into it.

Longer story

Rejected approach: equal weighting

The simplest approach to weight is equal-weighting (which is the same as “don’t have a weight factor, only consider usage”). The issues with this include:

  • a very strong incentive to split up a package into many small packages
  • the intuition that a huge framework and a 2-line function are not the same amount of maintenance effort or the same amount of value-to-subscribers

The issue with bad incentives isn’t only that people might game Tidelift on purpose, it’s also that in the wild people have already sometimes split things up and sometimes haven’t, for technical or practical reasons.

Rejected approach: size of the entire package

The next-simplest approach we came up with was “the size of the package” (like “make an http HEAD request to the package’s download URL and get the Content-Length”). The issues with this include:

  • some packages include N copies of their code, like a regular version, minified version, minified-a-different-way version, etc.
  • some packages include various data files, test files, vendored code, autogenerated code, etc.
  • sensitive to level of gzip or zip applied to the package
  • includes package manager metadata which means a slight gain from splitting up packages still

I actually tried this since it seemed like the simplest thing that could possibly work, and it did not work. The results were not good. For example, I was surprised how many packages ship their unit tests, including massive fixtures sometimes, right in the released package!

Current approach: adjusted code size

Getting a little more complex is an “adjusted code size.” What this means is that we unpack the package (removing compression), filter out files that aren’t code, filter out various kinds of code that shouldn’t count (like vendored dependencies), and then add up the sizes of the remaining code files.

This is what we’re doing now and the results feel pretty good; the packages at the top of the weightings are substantive packages with a lot of maintenance work going into them. Splitting up packages into smaller packages ought to have no effect, total weight would remain the same.

We also do some normalization by ecosystem to make npm and Java more comparable (since many subscribers are using multiple ecosystems).

Future ideas

Conceptually, the weight indicates how much relative value each package provides to a single subscriber. Usage then considers how many subscribers are receiving that value.

We could incorporate some signal from subscribers of “how much I care about this package” into the weight number. I don’t think it’s a good idea to let subscribers completely pick-and-choose which packages get what, because if they get no value from something, why are they using it? We want to lift all boats. Also, no subscriber wants to micromanage weights on 3000 packages.

But perhaps there are ways for them to say “I really really care about package xyz” and factor that in, and we’ve heard the desire to do so from them.

I tend to think we should avoid anything in the weight that’s redundant with usage. For example, download counts or GitHub stars or other popularity measures. I might expect these to correlate with usage, so pulling them in might double-count the same factor.

If people do start to game things, it could work out a lot like Google’s search algorithm or spam filtering algorithms, where we keep having to adapt. However, the absolute numbers of “packages people actually use” are a lot lower than the number of web pages or spam emails on the Internet, so manual-intervention solutions are more practical. We also have a business relationship and contract with all lifters, which helps.

By the way: if we do change the weighting algorithm in the future, mitigating the impact on lifters will be an important consideration. There are several ways to do that so we don’t pull the rug out from under anyone.

Feelings and ideas welcome

We can definitely make changes and evolve things from here.

8 Likes

I completely understand the trickiness of figuring this out, and it sounds like you have put good thought and experiments into it. You mentioned “how much I care about this package” as a factor. Another way to look at it is that some packages “punch above their weight.” That is, they provide outsized value compared to their bulk. Perhaps if the subscribers had some way to indicate that?

I have no idea how engaged the subscriber are, if they care about this formula, or if they even have a sense of the size of packages? You’d probably have to show them a list, with the “value assumed based on size”, so that they could gauge how well it matches their perception. But as I type that out, I can guess that no one will put that much effort into it :slightly_smiling_face:

3 Likes

The simplest idea I’ve come up with is to ask subscribers: “please tell us which dependencies are most critical to you” and allow subscribers to put a check by some of their deps. I think we’d want to user test this some and see if it makes sense to anyone.

I’d love to have some more automated way to get at this without someone having to do work - but it’s tough to think of what it’d be.

I would have chosen a type-weighting approach.

You would have different tiers:

  1. programming language / compiler
  2. framework
  3. library
  4. polyfill
  5. helper function

I mainly provide 2, 3 and 4 and would completely understand if the maintainers in the top tiers (0 and 1) get more.

Interesting! That’s a new idea. Can think of some challenging aspects of it. But there’s intuitive appeal that the things at the higher tiers feel more “central” so this could be a useful factor.

I’m not sure tiers will add much to the scheme you already have. A programming language should only get more funding if it is widely used, and you’ve already got the breadth of use as a factor. Should a little-used esoteric programming language get funding just because it’s a programming language?

I think the number of users and the size of the project will already pay programming languages well without adding an extra factor based on the kind of software it is.

I am talking about the ratio effort/retribution here: i.e. whether the programming language is popular doesn’t matter to me in that regard. There are frameworks based on a language making more money than the language creator himself, how’s that fair?

If the value is calculated as “size of code” * “number of users”, then a language should make more than any framework it’s based on, no?

Or perhaps the thing to acknowledge here is that FrameworkX uses LanguageX, so some of the calculated value for FrameworkX should be re-directed to LanguageX?

So untested code has the same “weight” as code with tests (that have to be written and maintained)?

@mbrookes It seems like you’re drawing a conclusion from something someone here has said, but I’m not sure from what.

Remember that usage is also a factor (and hopefully, over time, a referendum on quality and utility). My hope/belief is that many activities such as documentation, testing, patch review, etc. are reflected in increased usage.

A practical reason to exclude tests is that we’re weighing the shipped package, not the source repo. (Many repos publish multiple packages.)

Most packages don’t ship the tests; a surprising number I’ve looked at do, but it kinda seems like they do it accidentally (the consumer of the package afaik doesn’t import or use the tests).

So that’s why I’ve currently adjusted the tests out of those packages that seem to accidentally ship them, so weights are comparable between packages that do and don’t bundle their tests.

so some of the calculated value for FrameworkX should be re-directed to LanguageX?

It should if it’s accepting donations, yes.

Package weight is a subjective measure depending on value of the package in specific project. For cases when people can see it, this personal measure should be a priority. In my opinion. Every person on a project may have a package that he or she likes and wants to support, and people need to be given this ability. This makes Tidelift a more social story.

This almost fits Future ideas section if “micromanagement” is replaced with “let people play”. It could be your personal preferences, but if company hires you, they import your settings and may or may not distribute their weights accordingly. This makes you valuable artifact in you community regardless if you can participate or busy with a new job.

Speaking about gameplay. In agile development people are playing agile story points, so let them play infrastructure points to estimate the impact.

Subjective feeling of a project well-being is important as well. If one project has sufficient funds, then another might find them more important, but that’s a source of speculations, so unless people know each other personally, it may not work out well.

1 Like

Something I’d emphasize as background that relates to some of this discussion, is that we are paying lifters to do work and to be on the hook for certain things; Tidelift is not a donation system, it’s a system for maintainers to collaboratively provide valuable benefits to subscribers. (See also Product roadmap snapshot as of January 2019 and https://tidelift.com/docs/lifting/tasks-overview )

The way we frame it currently is that subscribers are covered for all packages they report that they use ( see https://tidelift.com/subscription/support ) so if they’re reporting it, they are getting the subscription benefits.

A related point is that subscription benefits and paying lifters are linked. So for example we don’t have a way to sign up to lift C packages right now, but we also don’t have a way for subscribers to get subscription benefits on C packages. In the current model, we’d want to add both of those at once.

This would be great, for both subscribers and lifters.

As a subscriber, I think I’d like something very simple like:

Pick 3 packages that, for any good reason, deserves an extra payout.

Let’s say 5% of Tidelift’s total subscriptions earnings (after Tidelift cut) every month will be paid out to the most highly favored packages in the open source ecosystem.

A subscriber’s 3 picks are essentially votes that decide the cut a given package gets from this bonus-payout, if any (there would obviously need to be a minimum). There could also be a cap on the max payout, so that even if a certain package got 46% of all votes, it can at most get a 10% piece of the bonus-payout.

Votes should be transparent ahead of payout. Not who (the companies) voted on what, but the number of votes different packages have.

As a lifter, this gives me an incentive to optimize for value in my package(s), as well as advertising my presence on Tidelift.

I wonder how many subscribers are geuninely aware of how important all their dependencies are? Lots of large code bases have components that are used that the subscriber never actually sees, but are important, and need supporting just as much as components that they interface with directly.

1 Like

It’d probably also be good for maintainers to weight their own dependencies versus their third party code, on a given package - on some packages, my transitive deps don’t really matter; on others, they’re the bulk of the work and should get the bulk of the funds.

I’m not sure that we should reward a package based on it’s size for a few reasons:

  • It does not reflect how useful the package is
  • I could just avoid minifying my code

About this last point, how do you handle it? Would it make any difference?

I minify all my projects with Webpack+Babel to release a package with three versions:

  • A “browser”: minified bundled with all dependencies and polyfills for the browser [umd]
  • A “module”: minified non-bundled with the few polyfills that are missing for node >= 12 [esm]
  • A “main”: minified non-bundled with the polyfills that are missing for for node >= 8 [cjs]

I would like to see an analysis of different (crossed) metrics like stars on github, number of downloads, forks, dependents, which mean potentially more customers for Tidelift? I prefer the point-of-interest-metric approach because it would directly benefit good project ideas.

Test code should count toward weight, because the more thorough the test suite, the more work tends to have been put into both building the test suite and fixing the bugs it caught.

1 Like

Test code should count toward weight

This is a valuable point that I mostly agree with, but am only starting to think through the implications of.

Based on the metrics I’ve recorded on projects I’ve worked on, a well-tested project will commonly consist of 75% or 80% tests (by line code.) So if we did this, well-tested projects would be weighted four or five times larger than untested projects.

Although I am a huge fan of hardcore testing and TDD, to me, this seems unfair. A project might, in theory, use no tests because it has some other (hypothetical) means of maintaining good fit-for-use, internal design, maintainability and quality.

If the untested project were not as good as a well-tested project, then users of that project should be the ones to judge, by selecting among competitors.

In essence, I’m saying that if a poorly tested project, which will presumably therefore have less useful functionality, be harder to maintain, and have more bugs, will be weighted down automatically by being used by fewer subscribers. We don’t need to additionally weight it down by code size.

On the other hand, if an untested project manages to somehow still provide useful functionality, be maintainable and responsive to new requirements, and few bugs, then it deserves a full share, rather than being artificially penalized for the methods it used to get there.

(In practice, I think it’s unlikely an untested project would be able to do this. As an industry we haven’t found a practice that is as good as good tests for these purposes. But the above seems right to me in principle. Even though I’m a test zealot.)