pyspark errors on windows

Pyspark Errors on Windows

PySpark on Windows: Three Errors Every Data Scientist Hits

HADOOP_HOME is unset, pyspark.pandas won’t import, and that index warning — all solved in one place


If you have ever tried to run PySpark on a Windows laptop, you have almost certainly seen at least one of these:

java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
ImportError: cannot import name '_builtin_table' from 'pandas.core.common'
PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `read_parquet`,
the default index is attached which can cause additional overhead.

None of these errors means your Spark installation is broken. Each one has a specific, fixable cause. This article walks through all three — what causes them, which fix applies to your situation, and the exact setup cell that prevents all of them from appearing in the first place.


The Environment This Applies To

  • OS: Windows 10 / 11
  • PySpark: 4.x (tested on 4.1.1)
  • Python: 3.10–3.12
  • Pandas: 2.2.x (more on this below)
  • PyArrow: ≥ 15.0
  • Java (JDK): 17 or 21 (required for PySpark 4.x — see below)

If you are on macOS or Linux, Error 1 does not affect you. Errors 2 and 3 are cross-platform.


Error 1 — HADOOP_HOME and hadoop.home.dir are unset

What causes it

PySpark on Windows requires a small binary called winutils.exe to manage file system permissions when writing output. On macOS and Linux the operating system handles this natively. On Windows, Hadoop’s shell layer looks for winutils.exe and fails with a FileNotFoundException if it is not found.

This error appears the moment you try to write anything — a Parquet file, a CSV, an ORC file — regardless of whether your Spark computation itself is correct.

# This triggers the error on Windows without winutils
df_enriched.write.mode("overwrite").parquet("../data/output/")

The fix

Step 1. Check your Java version first

PySpark 4.x officially dropped support for Java 8 and requires Java 17 or 21. If you are on an older JDK, upgrade before proceeding — many winutils-related errors are actually caused by an incompatible Java version underneath. See the JDK upgrade steps below.

Step 2. Download winutils

Download the full bin folder from the community repository:

https://github.com/kontext-tech/winutils/tree/master/hadoop-3.4.0-win10-x64/bin

Important: Do not download only winutils.exe and hadoop.dll. In practice, copying just those two files is often not sufficient. Download the entire bin folder and replace C:\hadoop\bin\ with its contents. This is the step that resolves the error when everything else has already been tried.

For PySpark 4.x the bundled Hadoop version is 3.3.x, but the hadoop-3.4.0 binaries are stable and widely tested against PySpark 4.

Step 3. Create the folder structure

C:\hadoop\
└── bin\
    ├── winutils.exe
    ├── hadoop.dll
    └── (all other files from the bin folder)

Step 4. Set the environment variable before any Spark imports

import os
os.environ["HADOOP_HOME"]     = r"C:\hadoop"
os.environ["hadoop.home.dir"] = r"C:\hadoop"

These two lines must appear before from pyspark.sql import SparkSession. Once the JVM starts it reads these values once and ignores any later changes.

Why not set it in Windows system variables? You can, and that is the permanent solution. Open PowerShell as Administrator and run:

[System.Environment]::SetEnvironmentVariable("HADOOP_HOME", "C:\hadoop", "Machine")

But setting it in code at the top of the notebook is safer for shared environments where the system variable may not be present.


Upgrading to JDK 21

PySpark 4.x requires Java 17 or 21. If you are on Java 8 or 11, upgrade before doing anything else.

Step 1. Download the JDK 21 installer

Go to the Oracle Java Downloads page, select the Windows tab, and download the x64 Installer (.exe).

Step 2. Run the installer

Launch the .exe file and follow the wizard. Note the installation path — it is usually:

C:\Program Files\Java\jdk-21

Step 3. Update your environment variables

Open Settings and search for Edit the system environment variables.

  • JAVA_HOME: Under System variables, find JAVA_HOME and change its value to C:\Program Files\Java\jdk-21. If it does not exist, click New and add it.
  • Path: Find the Path variable, click Edit, and delete any old Java entries (e.g. C:\Program Files\Java\jdk1.8.x\bin). Add a new entry: %JAVA_HOME%\bin.

Click OK on all windows to save.

Step 4. Verify

Open a new Command Prompt (this is necessary to refresh the variables) and run:

java -version

Expected output:

java version "21.x.x" ...

Error 2 — ImportError: cannot import name '_builtin_table'

What causes it

This error appears when you run:

import pyspark.pandas as ps

The full traceback points to:

File .../pyspark/pandas/groupby.py:48
    from pandas.core.common import _builtin_table
ImportError: cannot import name '_builtin_table' from 'pandas.core.common'

The cause is a Pandas version incompatibility. The _builtin_table attribute was a private internal in pandas.core.common that was removed in Pandas 2.2+. If your environment has Pandas 3.x installed — which is now the default when you run a plain pip install pandaspyspark.pandas cannot import cleanly.

Check your versions:

import pandas as pd
import pyspark
print(f"Pandas:  {pd.__version__}")
print(f"PySpark: {pyspark.__version__}")

If you see Pandas: 3.x.x alongside PySpark: 4.x.x, this is your problem.

The fix

Downgrade Pandas to 2.2.3, which is the latest version that pyspark.pandas in the 4.x line supports:

pip install "pandas==2.2.3"

Restart your kernel after the install, then verify:

import pandas as pd
print(pd.__version__)   # should print 2.2.3

Why not upgrade PySpark instead? As of early 2026, no released version of PySpark has updated pyspark.pandas to be compatible with Pandas 3.x. The _builtin_table dependency and several other private Pandas internals are still referenced in the pyspark.pandas source. Databricks — the company that maintains Apache Spark — explicitly pins pandas<3 in their managed environments for this reason.

If you cannot downgrade Pandas

If your project requires Pandas 3.x for other dependencies, skip pyspark.pandas entirely and use the native Spark DataFrame API instead. It is faster, has no Pandas version dependency, and is what production PySpark pipelines use:

# Instead of pyspark.pandas:
import pyspark.pandas as ps
psdf = ps.read_parquet("data.parquet")

# Use the native API:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("app").getOrCreate()
df = spark.read.parquet("data.parquet")
df.show(3)

For small result sets that need Pandas operations, collect after aggregation:

# Collect a small aggregated result to Pandas
result_pd = (
    df.groupBy("payment_type")
      .agg({"fare_amount": "mean"})
      .toPandas()   # safe because the result is small
)

Error 3 — PandasAPIOnSparkAdviceWarning: index_col not specified

What causes it

This is a warning, not an error. Your code still runs. But it fires every time you call ps.read_parquet() without specifying index_col:

PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `read_parquet`,
the default index is attached which can cause additional overhead.

The reason is architectural. pandas-on-Spark is a distributed system — rows live on different executor machines. A Pandas-style integer index (0, 1, 2, …) requires Spark to generate a globally consistent sequence across all partitions, which is an expensive operation. When you do not specify index_col, Spark has to do this work silently, and it warns you that it is doing so.

The fix

Pass index_col=None explicitly to tell Spark you intentionally do not want a sequential index. This both silences the warning and skips the overhead of generating one:

import pyspark.pandas as ps

psdf = ps.read_parquet(
    "../data/taxi/yellow_tripdata_2023-01.parquet",
    index_col=None
)
print(psdf.shape)

Alternatively, if your data has a natural key column, specify it as the index:

psdf = ps.read_parquet(
    "../data/taxi/yellow_tripdata_2023-01.parquet",
    index_col="tpep_pickup_datetime"
)

To silence all PandasAPIOnSparkAdviceWarning warnings globally for a session:

import warnings
from pyspark.pandas.utils import PandasAPIOnSparkAdviceWarning

warnings.filterwarnings("ignore", category=PandasAPIOnSparkAdviceWarning)

The Complete Setup Cell

Put this at the very top of every PySpark notebook on Windows. It handles all three issues at once:

# ── Environment setup — must run before any other imports ────────────
import os
import warnings

# Fix 1: winutils path (Windows only)
os.environ["HADOOP_HOME"]             = r"C:\hadoop"
os.environ["hadoop.home.dir"]         = r"C:\hadoop"

# Fix 2 (partial): suppress PyArrow timezone warning
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

# Fix 3: suppress pandas-on-Spark advisory warnings
import warnings
from pyspark.pandas.utils import PandasAPIOnSparkAdviceWarning
warnings.filterwarnings("ignore", category=PandasAPIOnSparkAdviceWarning)

# ── Imports ──────────────────────────────────────────────────────────
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, round, when, unix_timestamp,
    avg, count, stddev,
    sum as spark_sum,
    hour, month, broadcast
)
import pyspark.sql.functions as F
import pyspark.pandas as ps   # requires pandas==2.2.3

# ── Session ──────────────────────────────────────────────────────────
spark = (
    SparkSession.builder
    .appName("My PySpark App")
    .config("spark.driver.memory", "4g")
    .getOrCreate()
)

# ── Verify ───────────────────────────────────────────────────────────
import pandas as pd
print(f"Spark:   {spark.version}")
print(f"PySpark: {__import__('pyspark').__version__}")
print(f"Pandas:  {pd.__version__}")

# ── Load data ────────────────────────────────────────────────────────
psdf = ps.read_parquet(
    "../data/taxi/yellow_tripdata_2023-01.parquet",
    index_col=None   # Fix 3: suppress index overhead warning
)
print(f"Shape: {psdf.shape}")

Expected output:

Spark:   4.1.1
PySpark: 4.1.1
Pandas:  2.2.3
Shape: (3066766, 19)

Version Compatibility Reference

ComponentRequiredNotes
PySpark4.xTested on 4.1.1
Pandas2.2.33.x breaks pyspark.pandas
PyArrow≥ 15.0Set PYARROW_IGNORE_TIMEZONE=1
Python3.10–3.123.13 untested
Java (JDK)17 or 21PySpark 4.x dropped Java 8 support
winutilshadoop-3.4.0Windows only; use the full bin folder

Pin these in your requirements.txt:

pyspark==4.1.1
pandas==2.2.3
pyarrow>=15.0.0
numpy>=1.26

Quick Reference

ErrorRoot causeFix
HADOOP_HOME and hadoop.home.dir are unsetwinutils.exe missing or incompleteDownload full bin folder from hadoop-3.4.0, set HADOOP_HOME
ImportError: cannot import name '_builtin_table'Pandas 3.x removed private APIpip install "pandas==2.2.3"
PandasAPIOnSparkAdviceWarning: index_col not specifiedNo index column → overheadPass index_col=None to ps.read_parquet()
Errors persist despite correct HADOOP_HOMEIncompatible JDK versionUpgrade to JDK 21 and update JAVA_HOME

Summary

None of these errors indicate a broken Spark installation. They are all configuration or dependency version problems with deterministic fixes. The pattern is:

  1. Upgrade to JDK 21 first. PySpark 4.x requires Java 17 or 21. An incompatible Java version causes errors that can look like winutils or Hadoop problems.
  2. Install the full bin folder. Downloading only winutils.exe and hadoop.dll is often not enough — replace the entire C:\hadoop\bin\ directory.
  3. Set HADOOP_HOME before the JVM starts. Two lines, placed before any PySpark import.
  4. Pin Pandas at 2.2.3. pyspark.pandas has not caught up with Pandas 3.x yet.
  5. Pass index_col=None to every ps.read_parquet() call. One argument, zero warnings.

The complete setup cell above handles all of this. Copy it into your notebook template and you will not see any of these errors again.


Tags: PySpark, Python, Data Engineering, Windows, Pandas

Found this useful? Share it

1 thought on “Pyspark Errors on Windows”

Leave a Comment

Shopping Cart
  • Your cart is empty.