← Back to home

Dataset Integrity

We sell high-quality verified datasets via agentic AI & DAO voting; buys also mint a license NFT.

Problem Statement

THE PROBLEM WE SOLVE • It’s hard to trust data. You don’t know if it’s clean, current, legal to use, or fit for your ML use case.• Marketplaces rarely show HOW data was checked. Buyers end up re-auditing after purchase.• Licenses are messy. Can I use it for research? For commercial work? For how many seats? Do I get updates?• Publishers don’t get rewarded for quality. There’s no shared standard or fast way to prove quality and earn more.Impact: • Buyers waste time verifying basics instead of building.• Publishers waste time answering the same quality questions for every lead.WHO THIS IS FOR • AI teams (training/eval datasets across text, image, audio).• Fintech/quant teams (prices, events, alt-data).• Geo/weather users (tiles, forecasts, sensor streams).• Healthcare researchers (de-identified data).• Data publishers who want a clear, repeatable, trusted path to monetization.WHAT DIP [Dataset Integrity Protocol] ACTUALLY IS • A quality-first marketplace: Every dataset version must pass AI checks and community voting.• A simple access layer: After purchase, you sign a short message and receive a time-bound link.• A public rulebook: A DAO sets the standards (thresholds, weights, verifiers, fee splits) so the system stays fair and transparent over time.• A minimal on-chain license: You get a portable receipt of your rights. The data and quality signals remain off-chain, private, and fast.THE AI PLATFORM WE BUILT We built a full dataset analysis system that powers DIP’s quality checks.System Overview • We use Fetch.ai’s uAgents to coordinate autonomous validation tasks.• We use ASI:One Extended LLM for expert analysis, summaries, and recommendations.• A FastAPI service exposes clean REST endpoints.• A multi-tool pipeline scores quality, ML readiness, and compliance.Architecture & Components • Validation API Server (FastAPI):validation_api.py• Fetch.ai Agents: Orchestrator, Enhanced Validation Agent, Legal Compliance Agent• ASI:One LLM: “ASI:One Extended” model (64K context) for deep reasoning• Multi-Tool Analysis Pipeline: integrity, statistics, ML readiness, compliance, synthesisAgents (what they do) • Orchestrator Agent: Coordinates the full validation run end-to-end.• Enhanced Validation Agent: Runs data quality checks (missingness, duplicates, schema, outliers, correlations, baseline ML tests, etc.).• Legal Compliance Agent: Scans for PII, checks regulatory risk (GDPR/CCPA style), suggests governance actions.BUYER JOURNEY (WHAT YOU DO) • Browse listings. Each card shows a badge (“Verified/Needs Review/Rejected”), a Quality Score (e.g., 92/100), and a short report (“99.1% valid rows; 0.08% outliers; 0 PII”).• Open the listing to see sample preview, schema, license summary, and changelog.• Pay with a stable token (e.g., USDC/LSDC).• You get a portable license receipt.• Click “Access” → sign a short message → receive a time-bound link.• If a new version comes out, you’ll see its score and changelog so you can decide whether to upgrade.Why this is better: • You spend time using trusted data, not re-auditing.• Scores, notes, and risks are clear up front.• Your rights are portable; your data stays private and fast off-chain.DATASET PUBLISHER JOURNEY • Upload your dataset and describe schema + changes.• The AI pipeline runs and returns sub-scores and notes. • Curators review a sample and vote.• Your badge is issued (Verified / Needs Review / Rejected).• Listing goes live with Quality Score, report, and preview.• Revenue is split automatically by policy (creator / DAO / curators/verifiers).COMMUNITY VOTING (THE HUMAN LAYER) • Vetted curators up/down-vote each version after sampling.• Reputation weighting: accurate curators gain influence; noisy curators lose weight.• Fast loop: flagged issues get quick feedback; publishers resubmit.• Buyer ratings after purchase feed back into re-verification when needed.• Final badge requires BOTH AI score and curator score to meet thresholds.WHY A DAO • Sets the public standards: thresholds, weights, domain rules.• Approves/removes verifiers and curators (keeps the pipeline honest).• Aligns incentives: optional staking/slashing for reviewers; long-term stability for rules.DIP Tokenomics • Payments: Buyers pay in stablecoin (e.g., USDC). Most (e.g., 85–92%) goes to publishers; a small protocol fee funds rewards.• Fee Split: Protocol fee auto-splits into DAO treasury + Verifier pool + Curator pool • Staking: Users stake $DIP to access yeild and voting rights • Yielding: Staked participants earn epoch rewards from their pools, this will be earned in LUSD. We can utilise lending protocols, and external yeild farming providers like aave to genrate yeild on our stablecoin. This will be rewarded to stakers. • Governance ($DIP): Vote on fee %s, pool weights, stake sizes, slashing rules, and featured listing criteria.• Outcomes: Quality datasets earn more renewals; accurate verifiers compound yield; DAO funds audits, tooling, and growth.

Solution

A) FILECOIN/LIGHTHOUSE (Filecoin storage + data tokenisation + gating) • What we store:Raw dataset file → Lighthouse → returns file CID.ERC-721 metadata JSON for the license NFT (name/description/external_url/attributes + AI quality scores) → Lighthouse → returns metadata CID.“Dataset metadata” JSON (title, tags, source, analysis summary) → Lighthouse → returns metadata CID (also referenced from NFT). • How we store:Lighthouse SDK (XHR) for progressive upload + retry.Public gateway links via https://gateway.lighthouse.storage/ipfs/<CID>.Persistent storage on Filecoin via Lighthouse — satisfies “Must store data via Lighthouse + Filecoin”. • Data tokenisation on Lighthouse:We register the dataset + tokenURI on-chain (DIPDataDAO.submitDataset) and mirror the record on Lighthouse’s tokenisation platform (the metadata JSON contains the same canonical fields).The ERC-721 license (DIPDataNFT) is the ownership primitive; tokenURI points to Lighthouse CID. • Token-gated access (SDK):Frontend does NOT reveal the direct gateway link unless the connected wallet is ownerOf(tokenId) for the dataset’s license or passes an ERC-20/721 check.The “Purchased” badge in the marketplace is computed without events: we iterate NFT tokenIds and verify ownerOf(tokenId) === user, then map datasetOf(tokenId) → datasetId to unlock links.Lighthouse SDK can enforce token-based access (optional). Our gating logic is on-chain-first: ERC-721 license holder = unlock.B) DATA DAO + TOKENOMICS (LSDC payments; DIP staking yield) • Contracts:DIPDataDAO: the marketplace + governance.DIPDataNFT (ERC-721): license; datasetOf(tokenId) maps license → dataset id.DIPStaking: stake DIP (ERC-20), earn LSDC rewards. • Flows:Creator publishes:submitDataset(cid, title, tokenUri, price, qualityScore, …) → Status.Pending.Governance:vote(id, approve) increments approvals; upon reaching approvalVotes → Status.Approved (optional DataCoin reward to creator via dataCoinMinter).Purchase / license mint:approve(LSDC, DAO, price) then purchase(id).DAO splits revenue: creator 80%, dao 15%, ops 2%, rewards 3% (configurable).Rewards slice auto-forwards to DIPStaking.depositRewards (try/catch fallback to DAO treasury).DAO mints DataNFT to buyer: dataNft.mintTo(buyer, id, tokenUri).Staking yield:Users stake DIP in DIPStaking; pendingRewards(address) accrues in LSDC via rewardPerToken.claimRewards() withdraws LSDC; totalRewardsDistributed is visible in the UI.Governance boost (optional):On each vote, DAO calls staking.notifyGovernanceAction(voter) → temporary boostBps to staking rewards. • Why this fits Lighthouse’s “Data DAO + tokenisation” ask:Data is permanently stored via Lighthouse; the DataNFT tokenises access rights.Revenue from licensing auto-funds ongoing staking rewards, closing the value loop between contributors, DAO, and buyers.Token gating is enforced on-chain; Lighthouse SDK complements with link-level controls.C) ASI ALLIANCE (Fetch.ai + ASI:One) — Agents + LLM quality scoring • Agents (uAgents) we actually use:Orchestrator Agent: routes one dataset request through the pipeline and handles retries/timeouts.Validation Agent: runs the heavy checks (missingness, duplicates, type consistency, outliers, correlation/multicollinearity, baseline ML readiness).Legal Agent: scans for PII & compliance flags; outputs risk levels. • ASI:One Web3-native LLM:We send the tool outputs to ASI:One Extended for synthesis: executive summary, reasoned quality score (0–100), readiness tier, and concrete remediation steps.Output JSON is embedded in Lighthouse metadata and surfaced in the NFT attributes for transparency (on-chain pointer; off-chain rich content). • Agentverse & Chat Protocol eligibility:Agents are packaged for Agentverse listing (discoverable), and we expose a Chat Protocol endpoint so judges can converse with the Orchestrator about a dataset’s quality explanation. • MeTTa (structured knowledge):We serialize key validation facts (schema, anomalies, risk tuples) into a MeTTa-compatible fact set so multiple agents reason over the same ground truth.D) FLUENCE (cloudless, CPU-only agent hosting) • What we deploy on Fluence VMs:The Validation + Legal Agents and the thin FastAPI broker (CPU-only; scikit-learn/outlier/PII scans are CPU-feasible). • Why Fluence helps:Avoids centralized cloud lock-in; reproducible, cost-efficient validation backend for the demo.E) WHY THE TOKEN MODEL WORKS • Pay with LSDC → DAO splits → rewards slice funds DIPStaking. • Stake DIP → earn LSDC from marketplace activity (flywheel). • Governance can gate voting via DIP balance/stake (configurable in DAO). • Creator upside:Primary sales: 80% split on purchase.Optional approval reward (dataCoinMinter) when dataset passes governance. • Buyer trust:AI-backed quality score committed in NFT metadata; verifiable source proof pinned on Lighthouse.

Hackathon

ETHGlobal New Delhi

2025

Contributors