Open Source Software

  • eWRT - extensible Web Retrieval Toolkit
  • bibTexSuite - tools for publishing and searching BibTeX files
  • Nilsimsa - locality sensitive hashing algorithm
  • Inscriptis - converts HTML to text with support for cascaded style sheets (CSS)

eWRT (extensible Web Retrieval Toolkit)

The extensible Web Retrieval Toolkit provides a Python API for retrieving data from Web sources such as Delicious, Flickr, Yahoo, WikiPedia et al. and various helper classes for the caching and handling of Web data.

Resources

Source code

Clone the eWRT code repository
git clone https://github.com/weblyzard/ewrt.git 


bibTexSuite

The bibTexSuite provides tools for efficient handling of BibTeX files. Currently the suite comprises the bibSearch.py and bibPublish.py tool.

Resources

Source code

Clone the bibTexSuite repository
git clone https://github.com/AlbertWeichselbraun/bibTexSuite.git


Nilsimsa

A Java implementation of the Nilsimsa locality sensitive hash. The Nilsimsa algorithm computes a 256 bit hash value that indicates how different two strings are. The more similar the strings the smaller will be the bitwise difference between their respective Nilsimsa hashes. Therefore, Nilsimsa is well suited to detect texts of the same origin, such as slightly modified spam messages, updated newspaper articles, etc.

Resources

Source code

Clone the nilsimsa repository
git clone https://github.com/weblyzard/nilsimsa.git


Inscriptis

A fast, python based HTML to text converter with initial support for cascade style sheets (CSS). Inscriptis comprises a library usable in python programs as well as a command line tool for converting HTML to text.

Resources

Source code

Clone the inscriptis repository
git clone https://github.com/weblyzard/inscriptis.git