RDD_vs_DataFrame_NOTE.md
DE1 Lab 1: PySpark Warmup and Reading Plans
Student: Samba DIALLO | Date: 2025-10-23 | AI Assistant: Claude Haiku 4.5
Overview
This lab compares RDD API vs DataFrame API in Apache Spark through three implementations:
- Top-N word count with RDD (tokenization, reduceByKey, collect)
- Top-N word count with DataFrame (explode, groupBy, agg)
- Projection optimization experiment (select(*) vs select(col1, col2))
How Claude Haiku 4.5 Assisted
1. Setup & Environment Issues
- Problem: FileNotFoundError for SPARK_HOME path (
/path/to/spark-4.0.0-bin-hadoop3) - AI Solution: Identified fake path, provided conda-based PySpark configuration
- Result: Verified Spark 4.0.1 installation, fixed SPARK_HOME environment variable
2. CSV Path Resolution
- Problem: Spark couldn’t find
data/lab1_dataset_a.csv - AI Solution: Provided two options (move files or use absolute paths)
- Result: Lab executes successfully with correct file paths
3. Metrics Extraction from Spark UI
- Problem: Unclear how to extract Job ID, Stage ID, shuffle bytes from Spark UI
- AI Solution: Mapped official rubric columns to UI locations
- Result: Correctly populated
lab1_metrics_log.csvwith 14 fields
4. Comparative Analysis
- Problem: How to interpret Case A vs Case B results
- AI Solution: Provided Python code for shuffle reduction and time ratios
- Result: Generated automated comparative analysis
5. Query Plan Evidence
- Problem: Need to save Spark physical execution plans
- AI Solution: Provided extraction code with timestamps
- Result: Created
proof/plan_rdd.txtandproof/plan_df.txt
6. Git & Documentation
- Problem: How to structure multi-part deliverables
- AI Solution: Semantic commit messages and file organization
- Result: Clean commit history with proof artifacts
Key Lab Findings
| Metric | Case A (select *) | Case B (select category, value) |
|---|---|---|
| Shuffle Write | 560 B / 8 records | 560 B / 8 records |
| Elapsed Time | 25 ms | 30 ms |
| Stages | 1/1 (2 skipped) | 1/1 (1 skipped) |
| Locality | PROCESS_LOCAL | PROCESS_LOCAL |
Observation: Both cases produced identical shuffle write size due to Catalyst optimizer.
Deliverables
DE1_Lab1_Notebook_EN.ipynb— Complete notebook with 5 sectionsoutputs/top10_rdd.csv— RDD-based word countsoutputs/top10_df.csv— DataFrame-based word countsoutputs/lab1_metrics_log.csv— Spark UI metrics (14 fields)proof/plan_rdd.txt— RDD physical execution planproof/plan_df.txt— DataFrame physical execution plan
Conclusion
Claude Haiku 4.5 accelerated the lab by:
- Diagnosing environment setup issues immediately
- Explaining Spark UI metric extraction (critical for performance analysis)
- Automating comparative analysis and result interpretation
- Ensuring deliverables matched official rubric expectations
Lab Goal Achieved: Demonstrated competency in both RDD and DataFrame APIs with objective performance metrics from Spark UI.
Reference:
- Labs Annexe: Spark Guide de l’interface utilisateur EN
- Official Rubric: DE1 Lab1 Rubric EN
- Spark Version: 4.0.1 | Python: 3.10.18 | Env: conda de1-env