Star Forensic

Methodology

This tool reports a compact set of measurable star-behavior signals from sampled stargazer metadata. It does not produce accusations or binary judgments.

1) What we measure

New-account ratio. We calculate the share of sampled stargazers whose account age at the time of starring is under 60 days.

newAccountRatio = (count(starredAt - accountCreatedAt < 60 days) / sampleSize) * 100

Zero-activity ratio. We calculate the share of sampled stargazers that have both followers.totalCount = 0 and repositories.totalCount = 0.

zeroActivityRatio = (count(followers = 0 AND repositories = 0) / sampleSize) * 100

Velocity spikes. We bucket sampled stars into UTC hourly bins, then compute z-scores for each bucket relative to the sample's hourly distribution.

z = (hourCount - mean(hourCounts)) / stddev(hourCounts), flag when z > 3

2) Where baselines come from

Baseline thresholds are anchored to published external analysis from the Awesome Agents investigation and to findings discussed in the CMU ICSE 2026 study, Six Million Suspected Fake Stars on GitHub. We present thresholds as context markers (not universal truths), because repository categories and growth dynamics vary.

3) Sample size: 200 vs 1,000

To balance GitHub API rate limits and signal stability, our free public scanner uses a 200-account sample. Specifically, we fetch the first 100 stargazers (to capture initial launch dynamics) and the most recent 100 stargazers (to capture current momentum).

For anomaly rates above 20%, a 200-account sample provides useful confidence for directional detection. Larger samples improve precision, which is why our $9 Full Report expands the audit to a 1,000-account sample and evaluates deeper metadata (like fork-to-star ratios and 1-year commit histories). However, 200 is usually enough to identify whether a repo is likely typical versus visibly unusual on basic metrics.

4) Known limitations

5) Why no verdict

We intentionally avoid labels like "fake" or "real" for a repository. Signals can have many causes: legitimate launch spikes, influencer mentions, classroom cohorts, language-community concentration, or paid manipulation campaigns. This product is designed to expose measurable patterns so maintainers, investors, and users can apply their own context before drawing conclusions.