On the analysis of big terrible data

  • Extracting dates from unstructured text

    In this post I discuss an idea on how we can extract dates from unstructured text. The main issue is that dates a written in all sorts of formats and I want to extract and normalize all of them. My previous approach consisted of generating regular expressions, mapping each pattern to a date format. However, this quickly proved to be way too slow. Though I could’ve tuned the performance by combining the patterns or use things like pyparsing, I decided to look for other alternatives. In summary I use Aho-Corasick to find parts of dates and generate another trie to detect valid dates. The technique has enabled me to extract dates in time linear to the size of the text plus the number of results. This means that the time required to search is pretty much the same regardless if we search for a single format or a hundred. My primary application for the technique is to generate semantic timelines from unstructured text.

  • Doing face recognition with JavaCV

    I couldn’t find any tutorial on how to perform face recognition using OpenCV and Java, so I decided to share a viable solution here. The solution is very inefficient in its current form as the training model is built at each run, however it shows what’s needed to make it work.