How to handle pagination in web scraping?
Web scraping often gets tricky when data is spread across multiple pages. Missing anything beyond page 1 can leave your dataset incomplete or even useless.
In this guide, I’ll show you how to tackle pagination in all its forms from simple page numbers to infinite scroll, so you can extract every item you need.
What is pagination in web scraping?
Pagination is the practice of splitting a large set of results or content into separate pages (page 1, page 2, page 3, etc.), rather than showing everything on one long page.
Websites use pagination for a few key reasons:
- User Experience (UX): It’s easier to navigate content in chunks. Users can jump to the next page or a specific page number, rather than endlessly scrolling a massive list. This makes browsing more manageable and less overwhelming.
- Performance: Loading thousands of items on one page would be slow and heavy. Breaking data into pages improves page load times by fetching smaller pieces on demand. The site feels faster and uses less memory for each view.
- SEO (Search Engine Optimization): Pagination can help search engines crawl and index content more efficiently. Multiple well-structured pages with relevant content and links can be better for SEO than one giant page, and it avoids the drawbacks of infinite scroll for search indexing.
In short, showing everything on one page is usually impractical, so pagination ensures websites remain fast, organized, and user-friendly
Why is pagination challenging for scrapers?
The very things that make pagination useful for humans can frustrate web scrapers. There’s no single standard for how sites do pagination – every site can implement it differently:
- Some use simple URL query parameters (
?page=2), while others rely on HTML links or JavaScript events. - Many modern sites load new results dynamically via scripts (AJAX calls or infinite scroll), which means a basic HTML fetch won’t see the later pages.
- Page navigation can be hidden behind buttons or require user interaction (clicking or scrolling), making automation harder.
- If your scraper doesn’t recognize the pattern, it might get stuck in a loop or stop too early, missing data.
- Anti-bot considerations, if you rapidly fetch dozens of pages, you could hit rate limits or trigger anti-scraping measures (CAPTCHAs, IP blocks, etc.). Handling pagination safely means pacing your requests and sometimes mimicking a real user.
Understanding what type of pagination you’re dealing with is the first big step. Next, we’ll see why getting pagination right is so important for data quality.
Why pagination matters for high-quality datasets?
If you care about having complete, reliable data, you must handle pagination correctly.
Here’s why:
- Completeness of records: Paginated content is like a book split into chapters. If you only scrape the first chapter, you miss the rest of the story. For example, an e-commerce site might list 500 products over 50 pages. Scraping only page 1 would give you just 10 products, not very useful for analysis or comparison. High-quality datasets require gathering all the pages so you have the full picture.
- Avoiding bias or errors: In many cases, page 1 is not representative of the rest. Perhaps results are sorted by popularity or date, meaning later pages contain different kinds of entries. If you ignore pages 2, 3, and beyond, any insights or models built on that data could be biased or plain wrong.For instance, imagine scraping a job board and only collecting the first 20 listings, you might conclude there are no new jobs after a certain date simply because you missed the later pages.
- Real-world example, scraping a marketplace: Let’s say you’re scraping a real estate marketplace for rental listings. The site shows 1000 listings, 20 per page. If you stopped at page 1, you’d only have 2% of the data. You might miss entire neighborhoods or price ranges that appear deeper in the results. In practice, when DataHen performs such a scrape, it ensures every paginated section is covered so the final dataset of listings is complete and accurate.
Next, let’s explore the common patterns of pagination you’ll encounter and how to recognize each.
The most common types of website pagination (with examples)
Websites implement pagination in several ways. Understanding the pattern is critical because it determines how you’ll write your scraper.
Here are the most common types of pagination, with examples and tips for scraping each one:
1. Numbered pagination (page=2, page=3, …)
This is the classic pagination with numbered page links (often at the bottom of a page). You’ll see buttons or links for 1, 2, 3, etc., sometimes with “Next”/“Prev” arrows. Each number corresponds to a fixed URL or query parameter, making it the easiest pattern to automate.
For example, a news site might have URLs like https://example.com/articles?page=1, ...?page=2, ...?page=3 and so on. Many e-commerce and news sites use this approach, Amazon’s product listings or Google search results with start=10,20,... are typical cases.
A typical numbered pagination interface, with consecutive page links and “Previous/Next” buttons.
How to identify: Look at the page’s URL as you click through pages. Does a number in the URL change (or an incrementing parameter appear)? For instance, on ScrapeThisSite, page 1 has ...?page_num=1 and page 2 becomes ...?page_num=2. The HTML will show a series of <a> tags or buttons with page numbers.
How to scrape it: Once you know the URL pattern, you can simply loop through page numbers in your code. This can be as easy as incrementing a counter and constructing the URL for each page. For example, using Python and the Requests library:
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
for page in range(1, 6): # scrape first 5 pages for demo
url = base_url.format(page)
response = requests.get(url)
if response.status_code != 200:
break # stop if page is not accessible (might have reached end)
soup = BeautifulSoup(response.text, 'html.parser')
products = soup.select(".product_pod")
print(f"Page {page}: Found {len(products)} items")
# ... extract product details ...
In practice, you might not know the last page upfront. One strategy is to keep requesting pages until you get an empty result or a repeat. Another is to check each page’s HTML for a disabled “Next” button or the absence of a “Next” link to detect the end.
Numbered pagination is straightforward because you can predict URLs, but always build a stop condition (don’t blindly go on forever if a site unexpectedly loops).
2. Next/Previous button pagination
This pattern has “Next” and “Previous” buttons instead of direct numbered links. It’s common on simpler forums or article series. Each page points to the next one through a link or button, but the URL might not follow a simple numeric pattern.
How to identify: You’ll see a “Next” button at the bottom (and maybe a “Prev”). When you click Next, the URL might change in a non-obvious way (or sometimes not at all, if it loads in place). View the HTML and look for an anchor tag for the Next button.
For example, early web forums often had something like: <a href="page2.html">Next</a> or JavaScript-driven links labeled “Next”.
Risks: This requires crawling page by page by following the link each time, which can risk an infinite loop if the site’s navigation wraps around or if your scraper doesn’t realize it hit the last page. For instance, if the “Next” button is enabled even on the last page (perhaps linking back to the first page or to a duplicate page), an unguarded scraper could cycle endlessly.
How to scrape it: You typically need to fetch the page, parse it to find the “Next” link’s URL, then fetch that, and repeat. In pseudo-code:
url = "http://example.com/listings/page1.html"
while url:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# ... extract data from current page ...
next_link = soup.find("a", text="Next")
if next_link:
url = urljoin(base_url, next_link['href'])
else:
url = None # no Next link, end of paginationMake sure to convert relative URLs to absolute (using something like urljoin) if needed. Always check if the “Next” link is present or if it’s disabled. If it’s no longer there or points to the same page, you’ve reached the end.
Also, implement a safeguard (like a max page count or tracking visited URLs) to avoid infinite loops in cases of unexpected behavior.
3. "Load more" button (AJAX pagination)
Instead of distinct pages, some sites show a “Load more” (or “Show more results”) button at the bottom. Clicking it fetches the next batch of items and appends them to the current page without a full page reload. This is a hybrid between classic pagination and infinite scroll, it requires user action (click) but doesn’t change the page URL or make a new page load.
How to identify: You won’t see multiple page links or a next page URL. Instead, there’s a button that, when clicked, dynamically loads more content (via AJAX). If you scroll down and see a button that brings more entries into view, that’s a load-more pattern. The URL in the address bar stays the same.
From an HTML perspective, there might be a <button> or <a> element with text like “Load more” or an icon. Nothing in the static HTML indicates the next page URL, because the content is fetched by a script.
However, using your browser’s DevTools Network tab, you can observe what happens when you click the button, often it triggers an XHR request for data (maybe a JSON or an HTML snippet).
How to scrape it: There are two approaches:
driver = webdriver.Chrome()
driver.get("https://example.com/products")
while True:
try:
button = driver.find_element(By.XPATH, "//button[text()='Load More']")
button.click()
time.sleep(2) # wait for new content to load
except NoSuchElementException:
break # no more load-more button (end of list)
# Now all items are loaded, get page source:
page_html = driver.page_source
This clicks the button until it no longer finds it (meaning all content is loaded). Be cautious: repeatedly clicking can trigger anti-bot defenses if done too quickly or too many times, so you might need delays and perhaps proxy rotation for large numbers of clicks.
- Use the site’s underlying API calls directly: Often, the “Load more” button triggers a background request. Inspect the Network XHR traffic, you might find a request like
GET /products?offset=20&limit=20orPOST /load_morethat returns the next set of items in JSON. If such an API is discoverable, you can skip the browser automation and directly request those endpoints in a loop (this crosses into the API-based pagination method below). For instance, if clicking “Load more” calls an endpoint that returns a JSON list of items, you can script requests to?offset=0,20,40...until empty. This is often faster and more robust.
In summary, scraping “load more” requires either simulating user interaction or reverse-engineering the network calls. Both ensure you get results beyond the initial page.
4. Infinite scroll (automatic dynamic loading)
Infinite scroll is a variant of AJAX pagination where new content loads continuously as you scroll down, without any explicit “next” button or user click. Sites like Twitter, Instagram, and many news feeds use infinite scrolling, users can keep scrolling and new posts keep appearing at the bottom automatically.
How to identify: If the page keeps extending when you scroll and you never have to click “next” or “load more,” it’s infinite scroll. There will be no pagination controls in the UI. Under the hood, as you reach the bottom of the page, JavaScript fires off an AJAX request to fetch more content.
From a scraper’s perspective, the initial page HTML might contain only a small subset of items. The rest are fetched dynamically via network calls (often JSON data). Using DevTools, you’ll see requests being made as you scroll, for example, calls to an API like /search?after=<lastItemId> to get the next chunk.
Why it breaks traditional scrapers: If you use a simple requests.get() or BeautifulSoup on the page URL, you’ll only get the initial HTML, which might have just the first set of items. None of the later content exists in the HTML without executing the page’s JavaScript. That’s why infinite scroll requires more advanced handling, the data is lazy-loaded.
How to scrape it: You have two main options:
- Use a headless browser to simulate scrolling: Tools like Selenium or Playwright can scroll the page, triggering the JS events that load more content. The strategy is typically to scroll down incrementally and wait for new data until no more loads. For example, using Selenium:
driver = webdriver.Firefox()
driver.get("https://example.com/infinitelist")
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # wait for loading
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break # no more new content
last_height = new_height
page_html = driver.page_source
This script scrolls to the bottom over and over, until the page height stops changing (meaning no new items were added). At that point, all content is loaded and you can parse page_html with BeautifulSoup or similar to extract the data.
- Directly call the APIs behind infinite scroll: Like with the “load more” case, infinite scroll is powered by background requests. If you can sniff out the pattern (maybe the site uses a
pageparameter in an API call, or acursororoffsettoken), you can loop through those calls with regular HTTP requests. For instance, an infinite scroll might call/api/feed?cursor=<XYZ>where<XYZ>is an identifier for the next chunk. By replicating what the script does (often the responses include a new cursor or an indication of last page), you can retrieve all data without actually rendering the page.
Scraping infinite scroll is definitely more involved. If the site is heavy on JS frameworks (React, etc.), using an automated browser might be the straightforward path. In other cases, figuring out the API calls can save a lot of time. Note that infinite scroll can sometimes simulate an endless pagination so always implement a stop check (like the scrollHeight trick or checking if an API returns an empty result) to know when you’ve got everything.
5. API-based or offset-based pagination
This isn’t a visual pattern, but rather a method you’ll encounter by digging into network calls or official site APIs. In API-based pagination, the website’s backend provides a dedicated endpoint (often returning JSON) that accepts parameters for page number, page size, offsets, or cursors. For example, a URL like https://api.example.com/items?offset=20&limit=20 or ...?page=3&per_page=20 returns the next set of results.
Many modern web apps (and mobile apps) use JSON APIs even if the content is also displayed on the site. Sometimes these APIs are consumed by the site’s JavaScript (as in the load-more or infinite scroll scenarios), and sometimes they are public-facing for developers. Either way, they often provide a very clean and efficient way to scrape data.
How to identify: Use the browser’s Network tab. When you navigate through pages or trigger “load more,” look for XHR/fetch requests. If you see requests to a URL that returns structured data (especially JSON), that’s likely an API. Clues include responses containing JSON arrays or objects with data entries, or query strings with page, offset, limit, cursor, etc. For instance, when inspecting a site, you might notice a call like GET /api/search?query=shoes&page=2 whenever you click page 2 on the site’s UI.
Another clue is the presence of terms like “api” in URLs or in the site’s scripts. Some developers leave hints in the HTML or JavaScript variables.
How to scrape it: If you have an API endpoint, you should use it! It’s usually the most reliable and fastest method because you’re getting data in a structured form (no need to parse HTML).
For example, suppose you found that a site’s infinite scroll uses a call to https://example.com/api/products?page=1&limit=50. You can write:
import requests
base_url = "https://example.com/api/products"
params = {"page": 1, "limit": 50}
while True:
res = requests.get(base_url, params=params)
data = res.json()
items = data.get("products", [])
if not items:
break # no more products
for item in items:
# process item (already structured JSON)
pass
params["page"] += 1
This will loop through pages 1, 2, 3, ... until the API returns an empty list (or you can use a total count if provided). Some APIs use an offset rather than page number e.g., offset = 0,50,100... while limit=50. Others use a cursor (a token string that you get from one response to send in the next request). In each case, the principle is the same: keep requesting until you’ve got all data.
Often, API responses include metadata like total count or a “next page” token. Always check the JSON for fields like "hasNext": false or "nextPage": "...token..." which indicate when to stop or how to get the next batch.
Note: Using private APIs can sometimes be against a site’s terms of service, so be mindful. However, if the site itself is using that API via JavaScript, scraping it is usually akin to scraping the site openly. It’s just more efficient. And from a technical standpoint, API pagination is a scrapers’ dream, no parsing mess, and often fewer requests (because you can ask for larger chunks of data at once).
How to detect which pagination pattern a site uses
When you approach a new site to scrape, use this checklist to figure out the pagination scheme:
- Observe the URL as you navigate pages: Click on page 2 or hit “Next” on the site and see if the URL changes. If you see a parameter like
?page=2or an incrementing number in the path, it’s likely a URL-based pagination. If the URL stays the same, you might be dealing with a load-more or infinite scroll (or a Next button that uses JS). - Inspect the HTML for pagination controls: View the page source or use DevTools to look for elements that indicate pagination. Common signs are a list of page links (
<a>tags with numbers), a “Next” or “Previous” button, or a “Load more” button. The presence of these elements will tell you if it’s standard link-based pagination vs. a button. - Use DevTools Network panel: Open the Network tab (with XHR filter) and perform the pagination action (click Next, or scroll down). Watch for any network requests that happen. If you see calls for data (especially returning JSON), the site likely uses an AJAX approach or an API under the hood. This can reveal whether you need to simulate a browser or if you can call those endpoints directly.
- Check for JavaScript event listeners: Sometimes you might not immediately find a link or button. The site could be listening for scroll events or click events on certain elements. In DevTools Elements panel, look for onClick handlers or search the source (
Ctrl+F) for keywords like "nextPage", "loadMore", "scroll". The site’s JavaScript may have functions or variables hinting at how it loads the next set of data. - Simulate user actions in console: You can manually trigger possible events. For infinite scroll, go to the Console and run
window.scrollTo(0, document.body.scrollHeight)to force a scroll, see if new content loads. For a suspected button, you can try to trigger a click via JavaScript. This experimentation can confirm dynamic behavior.
Using these methods, you’ll pinpoint the pattern: whether it’s purely URL-based, requiring clicking through, or loading via scripts. Once you know that, you can choose the right scraping strategy.
Don’t Let Pagination Limit Your Scraping
Getting pagination right is the difference between partial data and powerful insight. Whether you're dealing with numbered links, infinite scroll, or hidden APIs, handling it properly ensures your dataset is complete, accurate, and reliable.
If you’d rather skip the trial-and-error and focus on what really matters, using the data, reach out to the experts. At DataHen, we specialize in scraping anything the internet throws at you, including the trickiest paginated content. Let us handle the complexity so you can get straight to the insights.
Need help with a complex scraping project? Contact DataHen, we’ll get you the data you need.