Back to blog

How to Scrape Reddit Without the Official API in 2026

by Simon Balfe ·

Reddit’s free API days ended in 2023. The official API still exists, it is still technically free for non-commercial use, and it is still the right starting point for a personal project. But for anything commercial, at scale, or plugged into an AI agent that needs to read 50 subreddits on a schedule, you will run into the same three problems developers have been running into since the pricing change:

  1. The free tier rate limits are aggressive (60 requests per minute authenticated).
  2. Commercial use triggers a $0.24 per 1,000 calls pricing tier that stacks up fast.
  3. OAuth setup, app review, and user agent policing all add friction for what should be a simple HTTP fetch.

This guide covers the three techniques developers actually use to get Reddit data without jumping through those hoops, ranked from easiest to most robust, with runnable Python for each. At the end I will explain when it makes sense to stop writing your own scraper and hand the problem to a data API instead.

Method 1: The .json endpoint trick

The single most underused feature of Reddit is that every page responds to a .json suffix with structured JSON. No API key. No OAuth. No app registration. Append .json to any subreddit, post, or user URL and you get the same data Reddit’s frontend uses.

import requests

def fetch_subreddit(subreddit: str, sort: str = "hot", limit: int = 25):
    url = f"https://www.reddit.com/r/{subreddit}/{sort}.json"
    headers = {
        "User-Agent": "MyResearchBot/1.0 (by /u/myuser)",
    }
    params = {"limit": limit, "raw_json": 1}
    res = requests.get(url, headers=headers, params=params, timeout=10)
    res.raise_for_status()
    posts = res.json()["data"]["children"]
    return [
        {
            "id": p["data"]["id"],
            "title": p["data"]["title"],
            "author": p["data"]["author"],
            "score": p["data"]["score"],
            "num_comments": p["data"]["num_comments"],
            "url": p["data"]["url"],
            "selftext": p["data"]["selftext"],
            "created_utc": p["data"]["created_utc"],
        }
        for p in posts
    ]

posts = fetch_subreddit("MachineLearning", sort="top", limit=100)
for p in posts[:5]:
    print(p["score"], p["title"])

You get the full post object back. selftext for text posts, url for link posts, score, comment count, author, created time, everything a logged-in user would see on the subreddit page.

The catch

Reddit aggressively rate-limits unauthenticated traffic. Expect roughly 30 requests per minute per IP before you start seeing 429 responses. The rate limit is not documented, it is not in the response headers, and it is enforced silently: you get a 200 with a degraded response body instead of a clear error. If your scraper suddenly starts returning fewer posts per call than you asked for, that is what is happening.

Three fixes, in order of effort:

  1. Slow down. time.sleep(2) between calls is enough for most projects.
  2. Rotate user agents. Reddit actually reads the User-Agent string and penalises suspicious ones. Make it identifiable and honest.
  3. Target old.reddit.com. The old subdomain serves the same JSON with lighter protection and no JavaScript rendering of HTML fallbacks.

For small-scale research or a side project reading one or two subreddits, Method 1 is all you need. Zero infrastructure, zero cost, working code in 10 lines.

Method 2: Pagination and comment threads

The .json endpoint paginates with an after token. Each response includes an after field in data.after that you pass as a query param on the next call.

import requests
import time

def fetch_all_posts(subreddit: str, sort: str = "top", pages: int = 5):
    headers = {"User-Agent": "MyResearchBot/1.0 (by /u/myuser)"}
    out = []
    after = None
    for _ in range(pages):
        params = {"limit": 100, "raw_json": 1}
        if after:
            params["after"] = after
        url = f"https://www.reddit.com/r/{subreddit}/{sort}.json"
        res = requests.get(url, headers=headers, params=params, timeout=10)
        res.raise_for_status()
        body = res.json()["data"]
        out.extend(body["children"])
        after = body.get("after")
        if not after:
            break
        time.sleep(2)
    return out

Comments live under a different URL pattern: https://www.reddit.com/r/{subreddit}/comments/{post_id}.json. The response is a two-element array: [0] is the post, [1] is the top-level comments tree. Replies nest inside each comment’s replies field.

def fetch_comments(subreddit: str, post_id: str):
    url = f"https://www.reddit.com/r/{subreddit}/comments/{post_id}.json"
    headers = {"User-Agent": "MyResearchBot/1.0 (by /u/myuser)"}
    res = requests.get(url, headers=headers, timeout=10)
    res.raise_for_status()
    data = res.json()
    comments_tree = data[1]["data"]["children"]
    return _flatten(comments_tree)

def _flatten(nodes, depth=0):
    out = []
    for n in nodes:
        if n["kind"] != "t1":
            continue
        d = n["data"]
        out.append({
            "id": d["id"],
            "author": d["author"],
            "body": d["body"],
            "score": d["score"],
            "depth": depth,
        })
        replies = d.get("replies")
        if isinstance(replies, dict):
            out.extend(_flatten(replies["data"]["children"], depth + 1))
    return out

Comment trees on large threads can run to thousands of nodes. Reddit uses a more stub (kind: "more") for deeply nested replies it does not load upfront. You either ignore those (fine for most analytics) or follow the children list in each more stub by making additional requests. Doing it properly is fiddly.

Method 3: Proxy rotation for scale

Past ~10,000 requests per day or when your IP gets soft-banned, you need rotating residential proxies. The cheap path is a commercial proxy service: ScraperAPI, Scrape.do, Bright Data, or similar. You send your request to their endpoint, they rotate the exit IP, and you pay per request.

def fetch_with_proxy(target_url: str, proxy_key: str):
    res = requests.get(
        "http://api.scraperapi.com",
        params={"api_key": proxy_key, "url": target_url, "country_code": "us"},
        timeout=30,
    )
    res.raise_for_status()
    return res.json()

This unblocks you immediately but you are now paying per request to a proxy vendor on top of whatever you pay for servers and engineering time to maintain the scraper. If your use case is anything more than “read public Reddit data for my product,” you are burning engineering capacity on infra that has no product value.

The maintenance tax

Everything above works in April 2026. None of it is guaranteed to work in July 2026. Reddit has been tightening access progressively since the API pricing change. They can turn off .json responses for unauthenticated traffic tomorrow (they already did it briefly in 2023 before reversing). They can tighten rate limits. They can block common data center IP ranges.

If your product depends on Reddit data, the question is not “can I write a scraper” but “how many engineering hours per month am I willing to spend keeping the scraper running.” For personal projects, the answer is “zero because I will fix it on a Saturday.” For a commercial product, the answer is usually “fewer than I think,” because every hour spent on scraper maintenance is an hour not spent on the actual product.

When to hand the problem to an API

The point at which DIY stops paying off is when you want any of:

  • More than one platform (Reddit + Twitter + YouTube, etc.)
  • Guaranteed uptime on the data layer
  • A support contract when something breaks
  • Native AI agent access via MCP rather than HTTP glue
  • Historical backfill that goes beyond what .json returns

CreatorCrawl covers Reddit alongside TikTok, Instagram, YouTube, Facebook, and Twitter/X under one API key. The Reddit endpoints today include:

  • Subreddit details (subscribers, description, rules)
  • Subreddit posts (hot, top, new, rising with pagination)
  • Subreddit search
  • Post comments (full nested tree, flattened or recursive)
  • Cross-subreddit search
curl "https://creatorcrawl.com/api/v1/reddit/subreddit/posts?subreddit=MachineLearning&sort=top" \
  -H "x-api-key: YOUR_API_KEY"

Because it is the same API that covers the other five platforms, an AI agent plugged in via the MCP server can search Reddit, cross-reference with Twitter, and pull YouTube comments without needing three different integrations. That collapses the maintenance tax on all of it to zero.

Pricing is pay-as-you-go credits. 250 credits free on signup, credits never expire, no subscription. For most teams the economics beat building and maintaining a Reddit scraper by a wide margin once you count engineering hours honestly.

Decision table

Use caseBest approach
One-off research, <1,000 postsMethod 1, .json endpoint
Personal project, ongoingPRAW (official API, free for non-commercial)
Commercial product, Reddit onlyMethod 1 or 2 + proxy rotation
Commercial product, multi-platformData API like CreatorCrawl
AI agent reading RedditMCP server (CreatorCrawl or similar)
Historical data dumpArctic Shift (academic torrents)

Wrapping up

Reddit is still one of the most scrape-friendly major platforms in 2026. You can get far with .json endpoints and a polite user agent. If you are building a product and not a research script, at some point it makes sense to stop writing scraper code and start shipping features, which is when a multi-platform data API with a real SLA becomes the obviously cheaper option.

Whichever route you pick, sign up for CreatorCrawl if you want to try the managed path. 250 credits free, no card, native MCP for Claude and Cursor.

Explore CreatorCrawl

More from the Blog