Knowledge Capture from Multiple Online Sources with the Extensible Web Retrieval Toolkit (eWRT)

Citation

Weichselbraun, Albert, Scharl, Arno and Lang, Heinz-Peter. (2013). Knowledge Capture from Multiple Online Sources with the Extensible Web Retrieval Toolkit (eWRT). Seventh International Conference on Knowledge Capture (KCAP-2013), Banff, Canada

Abstract

Knowledge capture approaches in the age of massive Web data require robust and scalable mechanisms to acquire, consolidate and pre-process large amounts of heterogeneous data, both unstructured and structured. This paper addresses this requirement by introducing the Extensible Web Retrieval Toolkit (eWRT), a modular Python API for retrieving social data from Web sources such as Delicious, Flickr, Yahoo! and Wikipedia. eWRT has been released as an open source library under GNU GPLv3. It includes classes for caching and data management, and provides low-level text processing capabilities including language detection, phonetic string similarity measures, and string normalization.

Keywords: data acquisition, knowledge extraction, text mining, structured and unstructured information sources, social media

Downloads and Resources

  1. Reference (BibTex)
  2. Full Article