Data Engineering 1

❯

Data Engineering 2

❯

Lab 2 Assignment

❯

ENGINEERING_NOTE

ENGINEERING_NOTE

23 mai 20261 min de lecture

DE2 Lab 2: Text Processing - Inverted Index with GitHub Archive

Objective

Build an inverted index from real GitHub Archive events to enable fast text search capabilities. Compare Parquet vs CSV storage formats for query latency and disk footprint.

Architecture

Data Source: GitHub Archive (sample_archive_github.json)

Real public GitHub events transformed into text documents
Each event becomes a document: event_type + repo_name + actor_login

Pipeline:

Ingestion: Load GitHub events with nested schema (repo, actor)
Transformation: Extract text content from events
Normalization: Lowercase, tokenization, stop-words removal
Index Building: Group by token, collect document IDs, count frequency
Persistence: Write to Parquet (columnar) and CSV (text)
Query Testing: Measure latency for term lookups

Data: GitHub Archive (real public events)

Documents: GitHub events (PushEvent, IssuesEvent, etc.)
Content: event type + repository name + actor login
Tokens: Individual words after normalization
Index: token → [doc_ids], frequency

Outputs

outputs/lab2/inverted_index/ - Parquet format (compressed, columnar)
outputs/lab2/inverted_index_csv/ - CSV format (text, universal)
proof/plan_index_build.txt - Execution plan for index construction
proof/plan_query.txt - Execution plan for term lookup
lab2_metrics_log.csv - Query latencies and storage metrics

Key Metrics

Unique terms in index
Query latency (Parquet vs CSV)
Disk footprint comparison
Compression ratio

Vue Graphique

DE2 Lab 2: Text Processing - Inverted Index with GitHub Archive
Objective
Architecture
Data: GitHub Archive (real public events)
Outputs
Key Metrics

Liens retour

Lab 2 - Assignment

Créé avec Quartz v4.5.2 © 2026

GitHub
Discord Community