SOSAD – Sourcing or Orchestrating a Sentiment Analysis Dataset



Project Skills

  • business & concepts
  • coding
  • design & art
  • education
  • engineering
  • environment
  • governance
  • healthcare
  • media & communications
  • social science

Project Blog

The brief

Sentiment analysis in Finnish is “the most asked after thing” (actual recent quote from leading Neuro-Linguistic Programming researchers), but a good enough dataset, in Finnish, does not exist. Machine learning algorithms learn from examples and therefore require a set of labelled data. A labelled datapoint is a sentence or a paragraph along with a label characterizing its sentiment, such as “positive”, “negative”, etc. Can we find a way to orchestrate the collection of a sentiment analysis dataset?

The background

Futurice ran into a roadblock lately in one of our charity projects analyzing how people discuss pharmaceutical drugs and and symptoms in a popular discussion forum. We wanted to do sentiment analysis, but lacked the tools. Hence the idea. We and others need the dataset, let’s get it done and open source it. It will benefit researchers, some companies, and overall society. In this project, Futurice data scientists as well as many researchers will be ready to actively participate and help get you to the finish line.

The problem

Can we create a new way to collect data about the sentiments underlying language? What we would like to have is either a dataset of a sufficient size, or better yet, a method for creating and expanding a dataset. Examples of possible methods to create the dataset. You can of course come up with new ones, or combine methods. • You manually label data • You create a game, which attracts people to label data for science • You find ways to create a large labelled dataset without much manual work from anyone. For example, reviews contain stars, which often indicate the sentiment of the review itself. Many social media messages contain smileys, which similarly indicate the sentiment of the message to some extent. Students could invent ways to leverage data sources like this to create a labelled sentiment dataset. It is also considered a good/excellent result from the project if we have a proven way to create and extend a dataset, like the mentioned gamified approach. Rough examples of what the desired outcome (labelled dataset) might look like: Example 1 (simple format, this is a good start): Sentences are labeled as positive/neutral/negative, 1 person has labelled each sentence. “Oli kyl paras matsi ikinä!”, positive “Ootko kokeillu laittaa sitä pois päältä ja päälle?”, neutral Example 2 (sentences labelled by multiple people, this would be a good result from the project): “Oli kyl paras matsi ikinä!”, Matti:positive, Mikko:positive “Ootko kokeillu laittaa sitä pois päältä ja päälle?”, Matti: neutral, Teemu: negative, Minna: neutral Example 3 (multiple sentiments possible, object of sentiment also captured, this would be an exceptional result, but might be very challenging to achieve): “Nordeas palvellaan asiakasta, sitä OP ei osaa”, positive:”Nordea”, negative:”OP” It is an advantage, if the amount of text in the dataset is large, even if it is not completely or even near-completely labeled. Partial datasets can be used for semi-supervised learning which tends to work better than simply having the smaller labeled part of the data.