A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling: Semantic Scraper

Umamageswari Kumaresan, Kalpana Ramanujam

Source Title: International Journal of Information Retrieval Research (IJIRR)12(1)

ISSN: 2155-6377|EISSN: 2155-6385|EISBN13: 9781683182085|DOI: 10.4018/IJIRR.290830

MLA

Kumaresan, Umamageswari, and Kalpana Ramanujam. "A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling: Semantic Scraper." IJIRR vol.12, no.1 2022: pp.1-18. http://doi.org/10.4018/IJIRR.290830

APA

Kumaresan, U. & Ramanujam, K. (2022). A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling: Semantic Scraper. International Journal of Information Retrieval Research (IJIRR), 12(1), 1-18. http://doi.org/10.4018/IJIRR.290830

Chicago

Kumaresan, Umamageswari, and Kalpana Ramanujam. "A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling: Semantic Scraper," International Journal of Information Retrieval Research (IJIRR) 12, no.1: 1-18. http://doi.org/10.4018/IJIRR.290830

Export Reference

Favorite Full-Issue Download

View Full Text HTML

View Full Text PDF

Abstract

The intent of this research is to come up with an automated web scraping system which is capable of extracting structured data records embedded in semi-structured web pages. Most of the automated extraction techniques in the literature captures repeated pattern among a set of similarly structured web pages, thereby deducing the template used for the generation of those web pages and then data records extraction is done. All of these techniques exploit computationally intensive operations such as string pattern matching or DOM tree matching and then perform manual labeling of extracted data records. The technique discussed in this paper departs from the state-of-the-art approaches by determining informative sections in the web page through repetition of informative content rather than syntactic structure. From the experiments, it is clear that the system has identified data rich region with 100% precision for web sites belonging to different domains. The experiments conducted on the real world web sites prove the effectiveness and versatility of the proposed approach.