Side Projects


Teaching at Carnegie Mellon University in Qatar

I am delighted with the challenge to teach Information Retrieval to undergrads at CMU Qatar.
This has been so far an exciting and challenging experience. All my course lectures and the material that I am using are publicly available following this GitHub link .


Python TRECTOOLs

I have been working on this set of tools for information retrieval evaluation in Python. It provides an interface to repetitive and common tasks such as analyzing runs, running an IR framework like Indri or Terrier with different baselines, evaluating runs with trec_eval, analyzing results, or even fusing ranking lists to create a more robust run.
For collection creators, it provides the ground for tasks such as document pool creation.

The project is in constant update. You can follow it at github or download the latest version with pip.


CLEF eHealth (TODO: update this description asap)

I am involved in CLEF eHealth since 2014, being together with Dr. Guido Zuccon , one of the main organizers of the Information Retrieval task since 2015.

In 2014, I helped running task 3, which was an ad-hoc information retrieval task focused in supporting laypeople in searching for and understanding their health information. The challenge mimic patients querying for some key disease that appeared in their discharge summary. The document collection was provided by Khresmoi project and contains more than one million Web pages covering a broad range of health topics, targeted at both the general public and healthcare professionals.

In 2015, we changed the task to focus on symptoms, rather than in diseases. We mimic queries of lay people that are confronted with a sign, symptom or condition and attempt to find out more about the condition they may have. Recent research has indicated that current web search engines fail to effectively support these queries (Zuccon et al., Staton et al.). Another innovation for 2015 is that we did our the first experiments with understandability: we asked our medical assessors to also assess, in parallel to the document relevance, whether they would recommend the document for their patient, taking into consideration the difficult to read that document. To the best of our knowledge, it is the first time that document readability is being assessed in an IR task, and we are going to investigate in details what is the impact of that in the rankings created.

See more: overview papers for CLEF eHealth lab overview 2014 and CLEF eHealth IR task overview for 2014.


TREC

In the TREC clinical decision support 2014/2015 (TREC-CDS), I participated representing Vienna University of Technology. The focus of this task was on providing material to support physicians when taking decisions regarding diagnosis, medical tests and treatments. The data collection was made of full-text documents from PubMed Central.

Although the main subject of this TREC track is also medical information retrieval, there were many significant differences between this one and CLEF eHealth. For example, readability is not a problem in the context of TREC-CDS, as IR systems should be used by physicians, instead of patients/general public. Nevertheless, all domain specific medical resources, such as Metamap annotations, MeSH, or UMLS, can be used in this task as well.

In 2015, our query expansion method got the second place (our of 30). You can check the details here.


Mediaeval

In 2014 and 2015, I participated in the retrieving diverse social images challenged at 2014 Mediaeval benchmark. The challenge consisted in re-ranking an initial Flickr results list to cope with diversity. In this context, a diverse list of images is a list that shows different perspectives of a specific point of interest. For example, a list that shows the Notre Dame cathedral from inside, from outside, from far away, and so on. We proposed an ensemble of clusters that worked relatively well (3rd place).

For 2015, we incorporated more features, including a deep learning solution. We also explored ways to combine the visual and textual features. This is our system that got the first place.


Readability

One of the subjects that I started working on for my PhD is the readability of textual documents. The challenge here is to match the reading skill of a person with the best possible material for him/her. There are a lot of traditional metrics for measuring how hard a text is, mostly based on surface features level characteristics of text, i.e., the length of words and sentences. I implemented them all in this open source python package <<ReadabilityCalculator>>.

If you are interested in learning more about readability, I highly recommend this link, which covers much of the literature on readability.

See more: readability-calculator source code


Kaggle

In 2013-2014 I had a great time with Kaggle. As soon as I got to know this website, I became addicted to it and started to participate in as many competitions as my schedule allowed me. However, as everybody knows, time is a limited resource…. Thus, I am much less active now. Nevertheless, if there is any interesting competition going on and you want to team up with me, please let me know! 🙂

See more: my kaggle profile