Open Source Software
- eWRT - extensible Web Retrieval Toolkit
- bibTexSuite - tools for publishing and searching BibTeX files
- Nilsimsa - locality sensitive hashing algorithm
- Inscriptis - converts HTML to text with support for cascaded style sheets (CSS)
eWRT (extensible Web Retrieval Toolkit)
The extensible Web Retrieval Toolkit provides a Python API for retrieving data from Web sources
such as Delicious, Flickr, Yahoo, WikiPedia et al. and various helper classes for the caching and handling of Web data.
Resources
Source code
Clone the eWRT code repository
git clone https://github.com/weblyzard/ewrt.git
bibTexSuite
The bibTexSuite provides tools for efficient handling of BibTeX files. Currently the suite comprises the bibSearch.py and bibPublish.py tool.
Resources
Source code
Clone the bibTexSuite repository
git clone https://github.com/AlbertWeichselbraun/bibTexSuite.git
Nilsimsa
A Java implementation of the Nilsimsa locality sensitive hash. The Nilsimsa algorithm computes a 256 bit hash value that indicates
how different two strings are. The more similar the strings the smaller will be the bitwise difference between their respective Nilsimsa hashes. Therefore, Nilsimsa is well suited to detect texts of the same origin, such as slightly modified spam messages, updated newspaper articles, etc.
Resources
Source code
Clone the nilsimsa repository
git clone https://github.com/weblyzard/nilsimsa.git
Inscriptis
A fast, python based HTML to text converter with initial support for cascade style sheets (CSS). Inscriptis comprises a library usable in python programs as well as a command line tool for converting HTML to text.
Resources
Source code
Clone the inscriptis repository
git clone https://github.com/weblyzard/inscriptis.git