Statistical approaches to Named Entity Recognition are trained for specific types of text and sometimes deliver poor performance on others, either due to language or formatting. Purely empirical approaches, like the one presented here, do not have this limitation and may thus be better suited for the messy data of digital investigations, as well as being easier to explain. Code and experimental corpus is made available. Feel free to send me an email if you would like to get the model.
In this post I discuss an idea on how we can extract dates from unstructured text. The main issue is that dates a written in all sorts of formats and I want to extract and normalize all of them. My previous approach consisted of generating regular expressions, mapping each pattern to a date format. However, this quickly proved to be way too slow. Though I could’ve tuned the performance by combining the patterns or use things like pyparsing, I decided to look for other alternatives. In summary I use Aho-Corasick to find parts of dates and generate another trie to detect valid dates. The technique has enabled me to extract dates in time linear to the size of the text plus the number of results. This means that the time required to search is pretty much the same regardless if we search for a single format or a hundred. My primary application for the technique is to generate semantic timelines from unstructured text.