On the analysis of big terrible data


  • Context-independent named entity recognition




    Statistical approaches to Named Entity Recognition are trained for specific types of text and sometimes deliver poor performance on others, either due to language or formatting. Purely empirical approaches, like the one presented here, do not have this limitation and may thus be better suited for the messy data of digital investigations, as well as being easier to explain. Code and experimental corpus is made available. Feel free to send me an email if you would like to get the model.



  • Extracting dates from unstructured text


    In this post I discuss an idea on how we can extract dates from unstructured text. The main issue is that dates a written in all sorts of formats and I want to extract and normalize all of them. My previous approach consisted of generating regular expressions, mapping each pattern to a date format. However, this quickly proved to be way too slow. Though I could’ve tuned the performance by combining the patterns or use things like pyparsing, I decided to look for other alternatives. In summary I use Aho-Corasick to find parts of dates and generate another trie to detect valid dates. The technique has enabled me to extract dates in time linear to the size of the text plus the number of results. This means that the time required to search is pretty much the same regardless if we search for a single format or a hundred. My primary application for the technique is to generate semantic timelines from unstructured text.



  • Doing face recognition with JavaCV


    I couldn’t find any tutorial on how to perform face recognition using OpenCV and Java, so I decided to share a viable solution here. The solution is very inefficient in its current form as the training model is built at each run, however it shows what’s needed to make it work.