Load Dataset From URL

A Clean and Reproducible Way to Load Datasets from URLs in Python

Building a reusable data loader for remote data sources

Introduction

In modern data science workflows, datasets are rarely stored locally from the beginning. Instead, they are often accessed directly from online sources such as GitHub repositories, public APIs, cloud storage, or research data portals. While tools like pandas make it easy to load data from a URL using a single line of code, real-world projects quickly expose a problem: simple one-liners are not enough for reproducibility, robustness, and maintainability.

What happens when a link changes, a dataset is compressed in a ZIP file, or the source requires preprocessing before loading? These challenges make ad-hoc data loading fragile and difficult to scale across projects.

This article focuses on building a clean, reusable, and reproducible way to load datasets from URLs in Python. You will learn how to design a small but powerful data loader that can handle common real-world scenarios, improve code organization, and make your data pipelines more reliable and portable across environments.

Dataset Source

We use the NYC Taxi demand dataset as an example, and the dataset is publicly available from the NAB repository:

https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv

It contains two columns:

timestamp — time of the observation
value — number of taxi passengers

The data spans July 2014 to January 2015 with a 30-minute sampling interval, totaling 10,320 observations.

A Reusable Data Loader

The following function downloads the dataset (if needed), caches it locally, and loads it as a time series.

import os
import pandas as pd
import urllib.request

def load_nyc_taxi_data(data_dir="./data"):
    """
    Load the NYC Taxi dataset from the NAB repository.
    The dataset is downloaded once and cached locally.
    """

    url = (
        "https://raw.githubusercontent.com/numenta/NAB/master/"
        "data/realKnownCause/nyc_taxi.csv"
    )

    os.makedirs(data_dir, exist_ok=True)
    local_path = os.path.join(data_dir, "nyc_taxi.csv")

    # Download if not already cached
    if not os.path.exists(local_path):
        print("Downloading NYC taxi dataset...")
        urllib.request.urlretrieve(url, local_path)
        print(f"Saved to {local_path}")
    else:
        print("Loading cached dataset.")

    df = pd.read_csv(local_path)

    df["timestamp"] = pd.to_datetime(df["timestamp"])
    df = df.set_index("timestamp").sort_index()

    return df["value"]

import os
import pandas as pd
import urllib.request

def load_nyc_taxi_data(data_dir="./data"):
    """
    Load the NYC Taxi dataset from the NAB repository.
    The dataset is downloaded once and cached locally.
    """

    url = (
        "https://raw.githubusercontent.com/numenta/NAB/master/"
        "data/realKnownCause/nyc_taxi.csv"
    )

    os.makedirs(data_dir, exist_ok=True)
    local_path = os.path.join(data_dir, "nyc_taxi.csv")

    # Download if not already cached
    if not os.path.exists(local_path):
        print("Downloading NYC taxi dataset...")
        urllib.request.urlretrieve(url, local_path)
        print(f"Saved to {local_path}")
    else:
        print("Loading cached dataset.")

    df = pd.read_csv(local_path)

    df["timestamp"] = pd.to_datetime(df["timestamp"])
    df = df.set_index("timestamp").sort_index()

    return df["value"]

This function ensures that:

the dataset is downloaded only once
it is stored locally in ./data/
the result is returned as a clean time-indexed series

Using the Loader

Once the function is defined, loading the dataset becomes extremely simple.

series = load_nyc_taxi_data()

series = load_nyc_taxi_data()

You can then inspect the dataset:

print(f"Dataset shape: {series.shape}")
print(f"Date range: {series.index[0]} to {series.index[-1]}")
print(f"Sampling interval: 30 minutes")
print(f"Total observations: {len(series)}")
print("\nBasic statistics:")
print(series.describe().round(2))

print(f"Dataset shape: {series.shape}")
print(f"Date range: {series.index[0]} to {series.index[-1]}")
print(f"Sampling interval: 30 minutes")
print(f"Total observations: {len(series)}")
print("\nBasic statistics:")
print(series.describe().round(2))

Example output:

Dataset shape: (10320,)
Date range: 2014-07-01 to 2015-01-31
Sampling interval: 30 minutes
Total observations: 10320

Dataset shape: (10320,)
Date range: 2014-07-01 to 2015-01-31
Sampling interval: 30 minutes
Total observations: 10320

Why This Approach Is Useful

This small utility function provides several advantages:

Reproducibility — anyone can run the code and automatically obtain the dataset.
Efficiency — the dataset is cached locally after the first download.
Clean workflow — the function returns a ready-to-use time series.
Reusability — the loader can be reused across notebooks, scripts, or research experiments.

Applications

The NYC taxi dataset is widely used for:

time-series visualization
anomaly detection research
wavelet and multiscale analysis
forecasting experiments
machine learning benchmarking

Because the dataset contains daily periodicity, long-term patterns, and event-driven anomalies, it serves as an ideal benchmark for studying real-world temporal dynamics.

Conclusion

In this article, we explored how to build a clean and reusable approach for loading datasets directly from URLs in Python, focusing on reproducibility and practical data engineering design. Instead of relying on one-off pandas calls, we developed a more structured data loading workflow that can handle real-world challenges such as remote file access, version changes, and different data formats.

To demonstrate the approach, we used the NYC Taxi demand dataset from the NAB (Numenta Anomaly Benchmark) GitHub repository as a working example. This dataset is widely used in time series and anomaly detection research, making it a realistic test case for building robust data ingestion pipelines.

By applying a reusable data loader design, we showed how a simple dataset retrieval task can be transformed into a consistent and maintainable component of a larger data science workflow. This approach not only improves code clarity but also supports reproducibility—an essential requirement for research, experimentation, and production systems.

If you extend this idea further, the same pattern can be adapted to handle compressed files, multiple datasets, or even full automated data ingestion pipelines.

Found this useful? Share it