Field Guide
Gather Datasets for AI and RAG Projects
Collect useful data without losing the source trail.
A practical guide for gathering datasets for AI, RAG, analytics, training, fine-tuning, evaluation, and prototypes while checking permissions, quality, safety, and reuse.
Steps from the workbench
Define why the dataset exists
Name the project goal before collecting anything. A RAG assistant, dashboard, training set, test set, and prototype sample all need different data.
What to check
The dataset purpose, expected users, needed fields, and file types are written down before collection starts.
Identify sources and permissions
List public datasets, documentation sites, APIs, internal exports, PDFs, and manual records. Review licenses, terms of service, robots.txt where relevant, and internal approvals.
What to check
Every source has a URL or owner, how it was accessed, collection date, license or terms note, and a clear yes/no on allowed use.
Choose the right collection tool
Use Kaggle for public datasets, APIs for organized data, Scrapy for larger website collections, BeautifulSoup for smaller page cleanups, database exports for existing records, and text tools for documents.
What to check
The tool matches the source size and can be run again later, with usage limits documented.
Preserve raw data first
Save raw files exactly as collected before cleaning or changing them. Raw data makes future review and fixes possible.
What to check
The raw dataset is stored separately from cleaned data and includes source notes, collection logs, and file versions.
Clean the data for the project
Make the raw data usable. Remove duplicates, fill or remove blank values, fix dates and numbers, standardize text, remove bad rows, and mask sensitive details like emails, phone numbers, passwords, API keys, and addresses.
What to check
The cleaned data has fewer duplicates, fewer blanks, consistent formats, no obvious sensitive details, and enough useful content for the project.
Prepare long text for RAG
For RAG projects, clean web page clutter, headers, footers, repeated navigation text, broken PDF text, outdated pages, and documents with no source link. Split long documents into smaller useful pieces.
What to check
Each text piece has useful content, a clear source link or file path, and enough context to make sense on its own.
Validate and version the dataset
Check missing values, repeated records, broken source links, sensitive data, text quality, file types, and known limitations. Save major updates as new versions so you know what changed.
What to check
The result follows the path raw data, clean data, checked data, usable dataset.
Prompts to adapt
Plan a dataset
Create a dataset collection plan for [project]. Include the goal, use case, needed fields, possible sources, recommended tools, permission checks, storage plan, and quality checks.
Audit legality and quality
Review this dataset source list for AI or RAG use. Check license, terms, robots.txt concerns, sensitive data risk, clear source links, missing source notes, and quality risks.
Plan RAG document splitting
Design a RAG document-splitting plan for these documents. Recommend piece size, overlap, split method, required source notes, source fields, duplicate checks, and review steps.
Create a cleaning checklist
Create a simple data cleaning checklist for this dataset. Include duplicates, missing values, dates, numbers, text cleanup, bad rows, sensitive data, source links, and final review steps.
Create dataset source notes
Create a dataset source-notes template with dataset name, project, purpose, source, tool used, owner, dates, license, record count, file type, storage location, known issues, and approval status.
