Comment travailler avec des tableurs d'un million de lignes

10 avril 2026 8 min de lecture

A million rows used to be a database problem. Now it is a Tuesday afternoon. Transaction logs, ad impressions, sensor readings, user event streams — the files that land on analysts' desks routinely contain millions of rows. And the tools most people reach for first — Excel, Google Sheets — were never designed for this scale.

If you have tried opening a million-row CSV in Excel, you know the outcome: it either truncates your data at row 1,048,576 (silently dropping the rest), freezes for several minutes, or crashes entirely. Google Sheets gives up even earlier, with a 10-million-cell cap that translates to around 500,000 rows for a typical dataset.

This guide covers practical strategies for actually working with million-row data, from quick command-line techniques to tools that handle the scale natively.

Pourquoi les tableurs traditionnels échouent à cette échelle

The spreadsheet was invented in 1979 for financial modeling. VisiCalc had 254 rows. Even today, the fundamental architecture has not changed much: load every cell into memory, maintain formatting for each one, keep a formula dependency graph, and render the entire grid so users can scroll through it.

This approach works brilliantly for the 10,000-row budget or the 50,000-row sales report. It breaks when you add another zero or two.

Row Count Excel Behavior Recommended Approach
Under 100K Works fine Any spreadsheet
100K – 500K Sluggish with formulas Excel with manual calc, or Viztab
500K – 1M Frequent freezes Viztab, Python, or DuckDB
1M – 10M Exceeds row limit Viztab, pandas, DuckDB, database
Over 10M Cannot open Database, Polars, Spark

The problem is not just the row limit. Even if your data fits under 1,048,576 rows, Excel loads every cell's formatting, metadata, and formula dependencies into RAM. A million rows with 20 columns means 20 million cell objects. With Excel's per-cell overhead, that is 2–4 GB of memory before you do anything.

Stratégie 1 : Filtrer avant de charger

The most effective strategy is also the simplest: do not load all the data. Most analysis questions do not require every row. If you need last quarter's Northeast sales, extract only those rows before opening anything visual.

Command-line pre-filtering

Terminal
# Extract rows where column 5 equals "Northeast" awk -F',' '$5 == "Northeast"' transactions.csv > northeast.csv # Extract rows from Q1 2026 grep "^2026-0[1-3]" transactions.csv > q1_2026.csv # Get just the header + first 100K rows for exploration head -n 100001 transactions.csv > sample.csv # Count rows before deciding your approach wc -l transactions.csv

These commands stream through the file line by line. They use almost no memory and finish in seconds, even on multi-gigabyte files. They are pre-installed on macOS and Linux, and available on Windows through WSL.

Python pre-filtering with chunks

If your filtering logic is more complex than a simple text match, pandas can read the file in manageable chunks:

Python
import pandas as pd # Read 200K rows at a time, keep only what you need chunks = pd.read_csv('transactions.csv', chunksize=200_000) filtered = pd.concat( chunk[chunk['region'] == 'Northeast'] for chunk in chunks ) filtered.to_csv('northeast.csv', index=False) print(f"Reduced to {len(filtered):,} rows")

Power Query pre-filtering (Excel)

If you must stay in Excel, use Power Query instead of opening the file directly. Go to Data → Get Data → From Text/CSV, then use the query editor to filter rows and remove columns before loading. Power Query streams the file so Excel never has to hold the full dataset in memory.

Stratégie 2 : Échantillonner d'abord, puis passer à l'échelle

When you are exploring an unfamiliar dataset, you do not need a million rows to understand its structure. A random 10,000-row sample tells you about columns, data types, value distributions, and quality issues. Build your analysis logic on the sample, then apply it to the full dataset.

Terminal
# Random 1% sample (assuming ~1M rows) awk 'BEGIN{srand()} NR==1 || rand() < 0.01' data.csv > sample_1pct.csv # Exact 10,000-row random sample (header + 10K data rows) head -1 data.csv > sample.csv tail -n +2 data.csv | shuf -n 10000 >> sample.csv

This technique is especially valuable for data validation. If your sample reveals encoding issues, malformed rows, or unexpected values, you have saved yourself the time of loading the full file only to discover problems.

Stratégie 3 : Agréger au lieu de parcourir

Most questions about million-row datasets are aggregate questions: total revenue by region, average response time by endpoint, count of events by category. You need a summary, not every individual row.

DuckDB: SQL directly on CSV files

DuckDB is an embedded analytics database that queries CSV and Parquet files in place — no import step, no server, no setup beyond installing it:

SQL (DuckDB)
-- Query a CSV file directly from the command line SELECT region, DATE_TRUNC('month', date) AS month, SUM(revenue) AS total_revenue, COUNT(*) AS transactions FROM 'transactions.csv' GROUP BY region, DATE_TRUNC('month', date) ORDER BY region, month;

This query runs in seconds on a million-row file. DuckDB uses columnar processing and does not load the full file into memory, so it works on files much larger than your available RAM.

Python pivot tables

Python
import pandas as pd df = pd.read_csv('transactions.csv') # Revenue breakdown by region and month pivot = df.pivot_table( values='revenue', index='region', columns=pd.to_datetime(df['date']).dt.to_period('M'), aggfunc='sum' ) print(pivot) # For very large files, use Polars instead of pandas import polars as pl df = pl.scan_csv('transactions.csv') result = (df.group_by('region') .agg(pl.col('revenue').sum()) .collect()) print(result)

Polars is worth noting here. For files over 1 GB, Polars is significantly faster than pandas because it uses lazy evaluation and processes data in parallel. The API is slightly different but the concepts are the same.

Stratégie 4 : Utiliser un tableur conçu pour l'échelle

Sometimes you need the spreadsheet experience. You want to scroll through rows, eyeball the data, spot anomalies visually, and click to sort and filter. For this, you need a tool that was designed from the ground up for large datasets.

Viztab uses a fundamentally different architecture than Excel. Instead of loading every cell into memory, it indexes the file on import and renders only the rows currently visible on your screen. The result is a spreadsheet that stays responsive with millions of rows.

1

Importez vos données

Open viztab.com/app and drag your CSV, TSV, or XLSX file. Multi-million-row files load in seconds.

2

Explorez interactivement

Sort by any column, filter by value or range, search across all data. Results are instant because they operate on the index, not a linear scan.

3

Analysez et exportez

Use any of 370+ formulas on the full dataset. Export your results or filtered subsets as CSV or XLSX.

Key differences from Excel at the million-row scale:

Essayez Viztab avec vos données →

Stratégie 5 : Savoir quand utiliser une base de données

If you query the same large dataset repeatedly, join it with other tables, or need multiple people to access it, set up a proper database. PostgreSQL, MySQL, and even SQLite handle millions of rows efficiently and give you the full power of SQL with proper indexing.

The trade-off is setup time. Creating a schema, importing data, and writing queries takes longer than dragging a file into a spreadsheet. For one-off analysis, it is usually not worth it. For anything recurring, it pays for itself on the second run.

Terminal
# Quick import into SQLite (no server needed) sqlite3 analysis.db .mode csv .import transactions.csv transactions .schema transactions # Now you can query with SQL SELECT region, SUM(revenue) FROM transactions GROUP BY region;

Conseils de performance universels

Questions fréquentes

Excel peut-il gérer 1 million de lignes ?

Technically, Excel supports up to 1,048,576 rows per worksheet. However, performance degrades significantly beyond 500,000 rows, especially with formulas. Sorting, filtering, and scrolling become sluggish, and any recalculation can freeze the application for extended periods. For data at the million-row mark, a purpose-built tool will be more reliable.

Quelle est la meilleure façon d'analyser un jeu de données de plus de 1 million de lignes ?

For interactive exploration with a visual interface, use Viztab or query the file directly with DuckDB. For statistical analysis, Python with pandas or Polars is the standard. For recurring reports, load the data into a database like PostgreSQL and use SQL. The best choice depends on whether you need a visual interface, programmatic access, or both.

Comment filtrer un tableur d'un million de lignes sans qu'il ne gèle ?

In Excel, filtering a million rows is slow because it evaluates every row against your criteria in the UI thread. Tools designed for large data like Viztab or DuckDB index the data or use columnar storage, so filters return results in milliseconds even on millions of rows. If you must use Excel, use Power Query to pre-filter during import.

Dois-je utiliser une base de données au lieu d'un tableur pour les grandes données ?

If you query the same large dataset repeatedly, join it with other tables, or share it across a team, a database is the right choice. But for ad hoc exploration of a file you just received, a spreadsheet interface is faster to start with. Tools like Viztab give you the visual spreadsheet experience with database-level performance for large files.

Vos données sont volumineuses. Vos outils doivent suivre.

Viztab gère les tableurs d'un million de lignes comme Excel gère un millier de lignes. Pas de limites, pas de latence, pas d'envoi.

Ouvrir Viztab