How to Work With Million-Row Spreadsheets

A million rows used to be a database problem. Now it is a Tuesday afternoon. Transaction logs, ad impressions, sensor readings, user event streams — the files that land on analysts' desks routinely contain millions of rows. And the tools most people reach for first — Excel, Google Sheets — were never designed for this scale.

If you have tried opening a million-row CSV in Excel, you know the outcome: it either truncates your data at row 1,048,576 (silently dropping the rest), freezes for several minutes, or crashes entirely. Google Sheets gives up even earlier, with a 10-million-cell cap that translates to around 500,000 rows for a typical dataset.

This guide covers practical strategies for actually working with million-row data, from quick command-line techniques to tools that handle the scale natively.

Why traditional spreadsheets fail at this scale

The spreadsheet was invented in 1979 for financial modeling. VisiCalc had 254 rows. Even today, the fundamental architecture has not changed much: load every cell into memory, maintain formatting for each one, keep a formula dependency graph, and render the entire grid so users can scroll through it.

This approach works brilliantly for the 10,000-row budget or the 50,000-row sales report. It breaks when you add another zero or two.

Row Count	Excel Behavior	Recommended Approach
Under 100K	Works fine	Any spreadsheet
100K – 500K	Sluggish with formulas	Excel with manual calc, or Viztab
500K – 1M	Frequent freezes	Viztab, Python, or DuckDB
1M – 10M	Exceeds row limit	Viztab, pandas, DuckDB, database
Over 10M	Cannot open	Database, Polars, Spark

The problem is not just the row limit. Even if your data fits under 1,048,576 rows, Excel loads every cell's formatting, metadata, and formula dependencies into RAM. A million rows with 20 columns means 20 million cell objects. With Excel's per-cell overhead, that is 2–4 GB of memory before you do anything.

Strategy 1: Filter before you load

The most effective strategy is also the simplest: do not load all the data. Most analysis questions do not require every row. If you need last quarter's Northeast sales, extract only those rows before opening anything visual.

Command-line pre-filtering

Terminal

# Extract rows where column 5 equals "Northeast"
awk -F',' '$5 == "Northeast"' transactions.csv > northeast.csv

# Extract rows from Q1 2026
grep "^2026-0[1-3]" transactions.csv > q1_2026.csv

# Get just the header + first 100K rows for exploration
head -n 100001 transactions.csv > sample.csv

# Count rows before deciding your approach
wc -l transactions.csv

These commands stream through the file line by line. They use almost no memory and finish in seconds, even on multi-gigabyte files. They are pre-installed on macOS and Linux, and available on Windows through WSL.

Python pre-filtering with chunks

If your filtering logic is more complex than a simple text match, pandas can read the file in manageable chunks:

Python

import pandas as pd

# Read 200K rows at a time, keep only what you need
chunks = pd.read_csv('transactions.csv', chunksize=200_000)
filtered = pd.concat(
    chunk[chunk['region'] == 'Northeast'] for chunk in chunks
)
filtered.to_csv('northeast.csv', index=False)
print(f"Reduced to {len(filtered):,} rows")

Power Query pre-filtering (Excel)

If you must stay in Excel, use Power Query instead of opening the file directly. Go to Data → Get Data → From Text/CSV, then use the query editor to filter rows and remove columns before loading. Power Query streams the file so Excel never has to hold the full dataset in memory.

Strategy 2: Sample first, then scale

When you are exploring an unfamiliar dataset, you do not need a million rows to understand its structure. A random 10,000-row sample tells you about columns, data types, value distributions, and quality issues. Build your analysis logic on the sample, then apply it to the full dataset.

Terminal

# Random 1% sample (assuming ~1M rows)
awk 'BEGIN{srand()} NR==1 || rand() < 0.01' data.csv > sample_1pct.csv

# Exact 10,000-row random sample (header + 10K data rows)
head -1 data.csv > sample.csv
tail -n +2 data.csv | shuf -n 10000 >> sample.csv

This technique is especially valuable for data validation. If your sample reveals encoding issues, malformed rows, or unexpected values, you have saved yourself the time of loading the full file only to discover problems.

Strategy 3: Aggregate instead of browsing

Most questions about million-row datasets are aggregate questions: total revenue by region, average response time by endpoint, count of events by category. You need a summary, not every individual row.

DuckDB: SQL directly on CSV files

DuckDB is an embedded analytics database that queries CSV and Parquet files in place — no import step, no server, no setup beyond installing it:

SQL (DuckDB)

-- Query a CSV file directly from the command line
SELECT region,
       DATE_TRUNC('month', date) AS month,
       SUM(revenue) AS total_revenue,
       COUNT(*) AS transactions
FROM 'transactions.csv'
GROUP BY region, DATE_TRUNC('month', date)
ORDER BY region, month;

This query runs in seconds on a million-row file. DuckDB uses columnar processing and does not load the full file into memory, so it works on files much larger than your available RAM.

Python pivot tables

Python

import pandas as pd

df = pd.read_csv('transactions.csv')

# Revenue breakdown by region and month
pivot = df.pivot_table(
    values='revenue',
    index='region',
    columns=pd.to_datetime(df['date']).dt.to_period('M'),
    aggfunc='sum'
)
print(pivot)

# For very large files, use Polars instead of pandas
import polars as pl

df = pl.scan_csv('transactions.csv')
result = (df.group_by('region')
            .agg(pl.col('revenue').sum())
            .collect())
print(result)

Polars is worth noting here. For files over 1 GB, Polars is significantly faster than pandas because it uses lazy evaluation and processes data in parallel. The API is slightly different but the concepts are the same.

Strategy 4: Use a spreadsheet built for the scale

Sometimes you need the spreadsheet experience. You want to scroll through rows, eyeball the data, spot anomalies visually, and click to sort and filter. For this, you need a tool that was designed from the ground up for large datasets.

Viztab uses a fundamentally different architecture than Excel. Instead of loading every cell into memory, it indexes the file on import and renders only the rows currently visible on your screen. The result is a spreadsheet that stays responsive with millions of rows.

Import your data

Open viztab.com/app and drag your CSV, TSV, or XLSX file. Multi-million-row files load in seconds.

Explore interactively

Sort by any column, filter by value or range, search across all data. Results are instant because they operate on the index, not a linear scan.

Analyze and export

Use any of 370+ formulas on the full dataset. Export your results or filtered subsets as CSV or XLSX.

Key differences from Excel at the million-row scale:

No row limit. Viztab handles millions of rows without truncation.
Instant filtering. Filters apply against an index. No waiting for a million-row scan.
Low memory usage. Only visible rows are rendered. The rest are indexed on disk, not held in RAM.
Local processing. Your data never leaves your browser. No upload, no server, no account needed.

Try Viztab with your data →

Strategy 5: Know when to use a database

If you query the same large dataset repeatedly, join it with other tables, or need multiple people to access it, set up a proper database. PostgreSQL, MySQL, and even SQLite handle millions of rows efficiently and give you the full power of SQL with proper indexing.

The trade-off is setup time. Creating a schema, importing data, and writing queries takes longer than dragging a file into a spreadsheet. For one-off analysis, it is usually not worth it. For anything recurring, it pays for itself on the second run.

Terminal

# Quick import into SQLite (no server needed)
sqlite3 analysis.db
.mode csv
.import transactions.csv transactions
.schema transactions

# Now you can query with SQL
SELECT region, SUM(revenue) FROM transactions GROUP BY region;

Performance tips that apply everywhere

Use Parquet instead of CSV when possible. Parquet is columnar and compressed. A 1 GB CSV becomes ~150 MB as Parquet, and queries that only need a few columns skip reading the rest entirely. Most modern tools (pandas, Polars, DuckDB, Viztab) read Parquet natively.
Drop columns you do not need. If your file has 50 columns and you only care about 5, select only those during import. Less data means faster everything.
Sort on your most-filtered column. If you always filter by date, pre-sort by date. Some tools can skip scanning irrelevant sections of sorted data.
Compress for storage, decompress for work. CSV files compress at roughly 10:1 with gzip. Keep the compressed copy as your archive.

Frequently asked questions

Can Excel handle 1 million rows?

Technically, Excel supports up to 1,048,576 rows per worksheet. However, performance degrades significantly beyond 500,000 rows, especially with formulas. Sorting, filtering, and scrolling become sluggish, and any recalculation can freeze the application for extended periods. For data at the million-row mark, a purpose-built tool will be more reliable.

What is the best way to analyze a dataset with more than 1 million rows?

For interactive exploration with a visual interface, use Viztab or query the file directly with DuckDB. For statistical analysis, Python with pandas or Polars is the standard. For recurring reports, load the data into a database like PostgreSQL and use SQL. The best choice depends on whether you need a visual interface, programmatic access, or both.

How do I filter a million-row spreadsheet without it freezing?

In Excel, filtering a million rows is slow because it evaluates every row against your criteria in the UI thread. Tools designed for large data like Viztab or DuckDB index the data or use columnar storage, so filters return results in milliseconds even on millions of rows. If you must use Excel, use Power Query to pre-filter during import.

Should I use a database instead of a spreadsheet for large data?

If you query the same large dataset repeatedly, join it with other tables, or share it across a team, a database is the right choice. But for ad hoc exploration of a file you just received, a spreadsheet interface is faster to start with. Tools like Viztab give you the visual spreadsheet experience with database-level performance for large files.

Your data is big. Your tools should keep up.

Viztab handles million-row spreadsheets the way Excel handles a thousand. No limits, no lag, no upload.

Open Viztab

How to Work With Million-Row Spreadsheets

Why traditional spreadsheets fail at this scale

Strategy 1: Filter before you load

Command-line pre-filtering

Python pre-filtering with chunks

Power Query pre-filtering (Excel)

Strategy 2: Sample first, then scale

Strategy 3: Aggregate instead of browsing

DuckDB: SQL directly on CSV files

Python pivot tables

Strategy 4: Use a spreadsheet built for the scale

Import your data

Explore interactively

Analyze and export

Strategy 5: Know when to use a database

Performance tips that apply everywhere

Frequently asked questions

Your data is big. Your tools should keep up.

Related posts