Methodology
This tool reports a compact set of measurable star-behavior signals from sampled stargazer metadata. It does not produce accusations or binary judgments.
1) What we measure
New-account ratio. We calculate the share of sampled stargazers whose account age at the time of starring is under 60 days.
newAccountRatio = (count(starredAt - accountCreatedAt < 60 days) / sampleSize) * 100
Zero-activity ratio. We calculate the share of sampled stargazers that have both followers.totalCount = 0 and repositories.totalCount = 0.
zeroActivityRatio = (count(followers = 0 AND repositories = 0) / sampleSize) * 100
Velocity spikes. We bucket sampled stars into UTC hourly bins, then compute z-scores for each bucket relative to the sample's hourly distribution.
z = (hourCount - mean(hourCounts)) / stddev(hourCounts), flag when z > 3
2) Where baselines come from
Baseline thresholds are anchored to published external analysis from the Dagster/Awesome Agents report and to findings discussed in the CMU ICSE 2026 study, "Six Million Suspected Fake Stars on GitHub." We present thresholds as context markers (not universal truths), because repository categories and growth dynamics vary.
3) Sample size and why 300
A 300-account sample is a practical balance between speed and signal stability. For anomaly rates above 20%, that sample size provides useful confidence at roughly the 95% level for directional detection, while keeping query costs bounded. Larger samples can improve precision, but 300 is usually enough to identify whether a repo is likely typical versus visibly unusual on these specific metrics.
4) Known limitations
- Cannot distinguish bot campaigns from legitimate viral events.
- Cannot detect sophisticated bots using aged accounts.
- Regional developer communities (e.g., some accounts in emerging markets) may have low-activity patterns that mimic bots.
- Results reflect the sample, not the entire stargazer population.
5) Why no verdict
We intentionally avoid labels like "fake" or "real" for a repository. Signals can have many causes: legitimate launch spikes, influencer mentions, classroom cohorts, language-community concentration, or paid manipulation campaigns. This product is designed to expose measurable patterns so maintainers, investors, and users can apply their own context before drawing conclusions.