An Iterative Approach to Record Deduplication

M. Roshini Karunya; S. Lalitha; B.Tech.; M.E.

抽象的

An Iterative Approach to Record Deduplication

M. Roshini Karunya, S. Lalitha, B.Tech., M.E.,

Record deduplication is the task of identifying, in a data repository, records that refer to the same real world entity or object in spite of misspelling words, typos, different writing styles or even different schema representations or data types [1]. The existing system aims at providing Unsupervised Duplication Detection method which can be used to identify and remove the duplicate records from different data sources. UDD, which for a given query, can effectively identify duplicates from the query result records of multiple web databases. Two cooperating classifiers, a Weighted Component Similarity Summing Classifier (WCSS) and Support Vector Machine (SVM) are used to iteratively identify the duplicate records from the non duplicate record and we also present a Genetic Programming (GP) approach to identify record deduplication. Since record deduplication is a time consuming task even for small repositories, our aim is to foster a method that finds a proper combination of the best pieces of evidence, thus yielding a deduplication function that maximizes performance using a small representative portion of the corresponding data for training purposes. We propose two more algorithms namely Particle Swarm Optimization (PSO), Bat Algorithm (BA) to improve the optimization. Index Terms – Data mining, duplicate records, genetic algorithm

免责声明: 此摘要通过人工智能工具翻译，尚未经过审核或验证

期刊亮点

CDMA/GSM Communication Protocol 人工智能图案/图像识别先进的计算架构冷静科技基于代理的中间件安全系统宽带与智能网络开源软件数据仓库数据库安全数据结构无线传感器机器人技术生物信息学和计算生物学网格计算自主和上下文感知计算自组织网络自适应雷达技术高级数值算法

索引于

哥白尼索引

学术钥匙

引用因子

宇宙IF

参考搜索

哈姆达大学

世界科学期刊目录

国际创新期刊影响因子（IIJIF）

国际组织研究所 (I2OR)

宇宙

国际期刊

制药科学医学科学工程普通科学

国际计算机与通信工程创新研究杂志

抽象的

An Iterative Approach to Record Deduplication

期刊亮点

索引于

国际期刊

地址