Cómo Trabajar con Hojas de Cálculo de Millones de Filas

A million rows used to be a database problem. Now it is a Tuesday afternoon. Transaction logs, ad impressions, sensor readings, user event streams — the files that land on analysts' desks routinely contain millions of rows. And the tools most people reach for first — Excel, Google Sheets — were never designed for this scale.

If you have tried opening a million-row CSV in Excel, you know the outcome: it either truncates your data at row 1,048,576 (silently dropping the rest), freezes for several minutes, or crashes entirely. Google Sheets gives up even earlier, with a 10-million-cell cap that translates to around 500,000 rows for a typical dataset.

This guide covers practical strategies for actually working with million-row data, from quick command-line techniques to tools that handle the scale natively.

Por qué las hojas de cálculo tradicionales fallan a esta escala

The spreadsheet was invented in 1979 for financial modeling. VisiCalc had 254 rows. Even today, the fundamental architecture has not changed much: load every cell into memory, maintain formatting for each one, keep a formula dependency graph, and render the entire grid so users can scroll through it.

This approach works brilliantly for the 10,000-row budget or the 50,000-row sales report. It breaks when you add another zero or two.

Cantidad de filas	Comportamiento de Excel	Enfoque recomendado
Under 100K	Funciona bien	Cualquier hoja de cálculo
100K – 500K	Lento con fórmulas	Excel with manual calc, or Viztab
500K – 1M	Congelamientos frecuentes	Viztab, Python, or DuckDB
1M – 10M	Excede el límite de filas	Viztab, pandas, DuckDB, database
Over 10M	No se puede abrir	Database, Polars, Spark

The problem is not just the row limit. Even if your data fits under 1,048,576 rows, Excel loads every cell's formatting, metadata, and formula dependencies into RAM. A million rows with 20 columns means 20 million cell objects. With Excel's per-cell overhead, that is 2–4 GB of memory before you do anything.

Estrategia 1: Filtra antes de cargar

The most effective strategy is also the simplest: do not load all the data. Most analysis questions do not require every row. If you need last quarter's Northeast sales, extract only those rows before opening anything visual.

Pre-filtrado por línea de comandos

Terminal

# Extract rows where column 5 equals "Northeast"
awk -F',' '$5 == "Northeast"' transactions.csv > northeast.csv

# Extract rows from Q1 2026
grep "^2026-0[1-3]" transactions.csv > q1_2026.csv

# Get just the header + first 100K rows for exploration
head -n 100001 transactions.csv > sample.csv

# Count rows before deciding your approach
wc -l transactions.csv

These commands stream through the file line by line. They use almost no memory and finish in seconds, even on multi-gigabyte files. They are pre-installed on macOS and Linux, and available on Windows through WSL.

Pre-filtrado en Python con bloques

If your filtering logic is more complex than a simple text match, pandas can read the file in manageable chunks:

Python

import pandas as pd

# Read 200K rows at a time, keep only what you need
chunks = pd.read_csv('transactions.csv', chunksize=200_000)
filtered = pd.concat(
    chunk[chunk['region'] == 'Northeast'] for chunk in chunks
)
filtered.to_csv('northeast.csv', index=False)
print(f"Reduced to {len(filtered):,} rows")

Pre-filtrado con Power Query (Excel)

If you must stay in Excel, use Power Query instead of opening the file directly. Go to Data → Get Data → From Text/CSV, then use the query editor to filter rows and remove columns before loading. Power Query streams the file so Excel never has to hold the full dataset in memory.

Estrategia 2: Muestrea primero, luego escala

When you are exploring an unfamiliar dataset, you do not need a million rows to understand its structure. A random 10,000-row sample tells you about columns, data types, value distributions, and quality issues. Build your analysis logic on the sample, then apply it to the full dataset.

Terminal

# Random 1% sample (assuming ~1M rows)
awk 'BEGIN{srand()} NR==1 || rand() < 0.01' data.csv > sample_1pct.csv

# Exact 10,000-row random sample (header + 10K data rows)
head -1 data.csv > sample.csv
tail -n +2 data.csv | shuf -n 10000 >> sample.csv

This technique is especially valuable for data validation. If your sample reveals encoding issues, malformed rows, or unexpected values, you have saved yourself the time of loading the full file only to discover problems.

Estrategia 3: Agrega en lugar de explorar

Most questions about million-row datasets are aggregate questions: total revenue by region, average response time by endpoint, count of events by category. You need a summary, not every individual row.

DuckDB: SQL directamente sobre archivos CSV

DuckDB is an embedded analytics database that queries CSV and Parquet files in place — no import step, no server, no setup beyond installing it:

SQL (DuckDB)

-- Query a CSV file directly from the command line
SELECT region,
       DATE_TRUNC('month', date) AS month,
       SUM(revenue) AS total_revenue,
       COUNT(*) AS transactions
FROM 'transactions.csv'
GROUP BY region, DATE_TRUNC('month', date)
ORDER BY region, month;

This query runs in seconds on a million-row file. DuckDB uses columnar processing and does not load the full file into memory, so it works on files much larger than your available RAM.

Tablas dinámicas en Python

Python

import pandas as pd

df = pd.read_csv('transactions.csv')

# Revenue breakdown by region and month
pivot = df.pivot_table(
    values='revenue',
    index='region',
    columns=pd.to_datetime(df['date']).dt.to_period('M'),
    aggfunc='sum'
)
print(pivot)

# For very large files, use Polars instead of pandas
import polars as pl

df = pl.scan_csv('transactions.csv')
result = (df.group_by('region')
            .agg(pl.col('revenue').sum())
            .collect())
print(result)

Polars is worth noting here. For files over 1 GB, Polars is significantly faster than pandas because it uses lazy evaluation and processes data in parallel. The API is slightly different but the concepts are the same.

Estrategia 4: Usa una hoja de cálculo construida para la escala

Sometimes you need the spreadsheet experience. You want to scroll through rows, eyeball the data, spot anomalies visually, and click to sort and filter. For this, you need a tool that was designed from the ground up for large datasets.

Viztab uses a fundamentally different architecture than Excel. Instead of loading every cell into memory, it indexes the file on import and renders only the rows currently visible on your screen. The result is a spreadsheet that stays responsive with millions of rows.

Importa tus datos

Open viztab.com/app and drag your CSV, TSV, or XLSX file. Multi-million-row files load in seconds.

Explora interactivamente

Sort by any column, filter by value or range, search across all data. Results are instant because they operate on the index, not a linear scan.

Analiza y exporta

Use any of 370+ formulas on the full dataset. Export your results or filtered subsets as CSV or XLSX.

Key differences from Excel at the million-row scale:

No row limit. Viztab handles millions of rows without truncation.
Instant filtering. Filters apply against an index. No waiting for a million-row scan.
Low memory usage. Only visible rows are rendered. The rest are indexed on disk, not held in RAM.
Local processing. Your data never leaves your browser. No upload, no server, no account needed.

Prueba Viztab con tus datos →

Estrategia 5: Sabe cuándo usar una base de datos

If you query the same large dataset repeatedly, join it with other tables, or need multiple people to access it, set up a proper database. PostgreSQL, MySQL, and even SQLite handle millions of rows efficiently and give you the full power of SQL with proper indexing.

The trade-off is setup time. Creating a schema, importing data, and writing queries takes longer than dragging a file into a spreadsheet. For one-off analysis, it is usually not worth it. For anything recurring, it pays for itself on the second run.

Terminal

# Quick import into SQLite (no server needed)
sqlite3 analysis.db
.mode csv
.import transactions.csv transactions
.schema transactions

# Now you can query with SQL
SELECT region, SUM(revenue) FROM transactions GROUP BY region;

Consejos de rendimiento que aplican en todas partes

Use Parquet instead of CSV when possible. Parquet is columnar and compressed. A 1 GB CSV becomes ~150 MB as Parquet, and queries that only need a few columns skip reading the rest entirely. Most modern tools (pandas, Polars, DuckDB, Viztab) read Parquet natively.
Drop columns you do not need. If your file has 50 columns and you only care about 5, select only those during import. Less data means faster everything.
Sort on your most-filtered column. If you always filter by date, pre-sort by date. Some tools can skip scanning irrelevant sections of sorted data.
Compress for storage, decompress for work. CSV files compress at roughly 10:1 with gzip. Keep the compressed copy as your archive.

Preguntas frecuentes

¿Puede Excel manejar 1 millón de filas?

Technically, Excel supports up to 1,048,576 rows per worksheet. However, performance degrades significantly beyond 500,000 rows, especially with formulas. Sorting, filtering, and scrolling become sluggish, and any recalculation can freeze the application for extended periods. For data at the million-row mark, a purpose-built tool will be more reliable.

¿Cuál es la mejor forma de analizar un conjunto de datos con más de 1 millón de filas?

For interactive exploration with a visual interface, use Viztab or query the file directly with DuckDB. For statistical analysis, Python with pandas or Polars is the standard. For recurring reports, load the data into a database like PostgreSQL and use SQL. The best choice depends on whether you need a visual interface, programmatic access, or both.

¿Cómo filtro una hoja de cálculo de un millón de filas sin que se congele?

In Excel, filtering a million rows is slow because it evaluates every row against your criteria in the UI thread. Tools designed for large data like Viztab or DuckDB index the data or use columnar storage, so filters return results in milliseconds even on millions of rows. If you must use Excel, use Power Query to pre-filter during import.

¿Debería usar una base de datos en lugar de una hoja de cálculo para datos grandes?

If you query the same large dataset repeatedly, join it with other tables, or share it across a team, a database is the right choice. But for ad hoc exploration of a file you just received, a spreadsheet interface is faster to start with. Tools like Viztab give you the visual spreadsheet experience with database-level performance for large files.

Tus datos son grandes. Tus herramientas deberían estar a la altura.

Viztab maneja hojas de cálculo de millones de filas como Excel maneja mil. Sin límites, sin retraso, sin subida.

Abrir Viztab

Cómo Trabajar con Hojas de Cálculo de Millones de Filas

Por qué las hojas de cálculo tradicionales fallan a esta escala

Estrategia 1: Filtra antes de cargar

Pre-filtrado por línea de comandos

Pre-filtrado en Python con bloques

Pre-filtrado con Power Query (Excel)

Estrategia 2: Muestrea primero, luego escala

Estrategia 3: Agrega en lugar de explorar

DuckDB: SQL directamente sobre archivos CSV

Tablas dinámicas en Python

Estrategia 4: Usa una hoja de cálculo construida para la escala

Importa tus datos

Explora interactivamente

Analiza y exporta

Estrategia 5: Sabe cuándo usar una base de datos

Consejos de rendimiento que aplican en todas partes

Preguntas frecuentes

Tus datos son grandes. Tus herramientas deberían estar a la altura.

Artículos relacionados