Harvest - An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums


Weichselbraun, Albert, Brasoveanu, Adrian M. P., Waldvogel, Roger and Odoni, Fabian. (2020). Harvest - An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums. 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Melbourne, Australia


Web forums discuss topics of long-term, persisting involvements in domains such as health, mobile software development and online gaming, some of which are of high interest from a research and business perspective. In the medical domain, for example, forums contain information on symptoms, drug side effects and patient discussions that are highly relevant for patient-focused healthcare and drug development. Automatic extraction of forum posts and metadata is a crucial but challenging task since forums do not expose their content in a standardized structure. Content extraction methods, therefore, often need customizations such as adaptations to page templates and improvements of their extraction code before they can be deployed to new forums. Most of the current solutions are also built for the more general case of content extraction from web pages and lack key features important for understanding forum content such as the identification of author metadata and information on the thread structure. This paper, therefore, presents a method that determines the XPath of forum posts, eliminating incorrect mergers and splits of the extracted posts that were common in systems from the previous generation. Based on the individual posts further metadata such as authors, forum URL and structure are extracted. We evaluate our approach by creating a gold standard which contains 102 forum pages from 52 different Web forums, and benchmarking against a baseline and competing tools

Keywords: Content Extraction, Forum Extraction, Information Extraction, Gold Standard, WEB-FORUM-52

Downloads and Resources

  1. Reference (BibTex)
  2. Full Article