Assignment 2 - Generative AI Usage Documentation
Authors: DIALLO Samba, DIOP Mouhamed
Course: Data Engineering I - ESIEE Paris
Date: October 30, 2025
AI Tool Used
Tool: Claude Sonnet 4.5
Access Method: GitHub Copilot Chat integration in VS Code
Usage Period: Throughout Assignment 2 development
How We Used Generative AI
1. Environment Setup and Configuration
Task: Setting up PySpark environment and PostgreSQL connection
AI Assistance:
- Troubleshooting JAVA_HOME and SPARK_HOME configuration issues
- Resolving PostgreSQL JDBC driver integration
- Configuring Spark memory settings for large datasets (42M rows)
Example Interaction:
User: "How to connect Spark to PostgreSQL with read-only user?"
AI: Provided JDBC URL format and driver configuration code
2. Star Schema Design
Task: Designing dimensional model for retail data
AI Assistance:
- Recommended surrogate key generation strategies
- Explained type-2 slowly changing dimension (SCD) concepts
- Suggested optimal grain for fact table
Learning Outcome: Understood difference between natural keys and surrogate keys, and why surrogate keys improve query performance.
3. Data Quality Implementation
Task: Implementing data validation rules
AI Assistance:
- Suggested null checks and foreign key validation patterns
- Provided regex patterns for data cleaning
- Recommended handling of orphan records
Code Example (AI-assisted):
# Validate foreign keys exist in dimensions
fact_with_valid_keys = fact_events.join(
dim_user,
fact_events.user_key == dim_user.user_key,
"inner"
)4. Performance Optimization
Task: Optimizing Spark jobs for 42M row dataset
AI Assistance:
- Explained shuffle partitions configuration
- Recommended broadcast joins for small dimensions
- Suggested Parquet compression strategies
Performance Gains:
- Before optimization: 25 minutes execution time
- After AI-suggested tuning: 8 minutes execution time (68% improvement)
5. Code Debugging
Task: Resolving errors during ETL execution
AI Assistance:
- Diagnosed “OutOfMemoryError” and suggested memory configurations
- Fixed “Column not found” errors in joins
- Resolved timezone issues in timestamp conversions
Example Error Resolution:
Error: java.lang.OutOfMemoryError: Java heap space
AI Solution: Increase spark.driver.memory to 4g and enable AQE
Result: Successfully processed all 42M rows
6. Documentation and Reporting
Task: Writing technical documentation
AI Assistance:
- Structured REPORT.md outline
- Generated Markdown formatting examples
- Suggested visualizations for data flow diagrams
What AI Did NOT Do
To maintain academic integrity, we ensured that:
- Core Logic: All ETL logic and business rules were designed by us based on understanding of data engineering concepts
- Schema Design: Star schema structure was designed independently after studying dimensional modeling principles
- Analysis: Data quality assessments and performance comparisons were our own analysis
- Problem Solving: When debugging, we first attempted to understand and solve issues before consulting AI
Learning Outcomes
Skills Developed with AI Assistance
- Faster Debugging: AI helped identify root causes of errors quickly, allowing more time for learning core concepts
- Best Practices: Learned industry-standard patterns for Spark optimization and dimensional modeling
- Documentation Skills: Improved technical writing through AI-suggested structure and clarity
Concepts Understood Through AI Explanations
- Surrogate Keys: Why and when to use them in data warehousing
- Broadcast Joins: How Spark optimizes joins with small dimension tables
- Parquet Columnar Storage: Why it’s more efficient than row-based CSV for analytics
Transparency Statement
We believe in transparent and ethical use of AI as a learning accelerator. The AI assisted with:
- Technical troubleshooting (30% of time saved)
- Code optimization suggestions (improved understanding of Spark internals)
- Documentation structure (professional formatting)
However, the core work represents our understanding and application of:
- Data engineering principles
- ETL pipeline design
- Star schema dimensional modeling
- Apache Spark distributed processing
Comparison: With vs Without AI
Time Investment
| Task | Without AI (estimated) | With AI (actual) | Time Saved |
|---|---|---|---|
| Environment Setup | 2 hours | 30 minutes | 75% |
| Schema Design | 4 hours | 3 hours | 25% |
| ETL Implementation | 8 hours | 6 hours | 25% |
| Debugging | 6 hours | 2 hours | 67% |
| Documentation | 3 hours | 1.5 hours | 50% |
| Total | 23 hours | 13 hours | 43% |
Quality Improvements
- Code Quality: AI suggested PEP-8 compliant formatting and best practices
- Error Handling: More robust error handling patterns
- Performance: 68% faster execution through AI-suggested optimizations
Ethical Considerations
Academic Integrity
We maintained academic integrity by:
- Using AI as a learning tool, not a replacement for understanding
- Always validating AI suggestions before applying them
- Crediting AI assistance in documentation
- Ensuring all deliverables reflect our own understanding
Proper Attribution
All AI-assisted sections are documented in this file. We did not:
- Copy-paste AI-generated code without understanding
- Use AI to complete assignments without learning the concepts
- Misrepresent AI-generated work as entirely our own
Conclusion
Claude Sonnet 4.5 served as an effective learning accelerator for Assignment 2. It helped us:
- Debug faster, leaving more time for concept mastery
- Learn industry best practices early in our education
- Produce higher-quality, well-documented code
We believe this transparent approach to AI usage aligns with modern engineering practices where AI tools (like Copilot, ChatGPT, Claude) are standard productivity enhancers.
Key Takeaway: AI is a powerful tool for learning data engineering, but understanding the underlying concepts remains essential for professional competence.
Authors: DIALLO Samba, DIOP Mouhamed
Submission Date: October 30, 2025
AI Tool: Claude Sonnet 4.5 (via GitHub Copilot)