Supervised Machine Learning for Text Analysis in R

By Julia Silge in rstats

July 24, 2020

Today, Emil Hvitfeldt and I led a useR! 2020 online tutorial on predictive modeling with text using tidy data principles. This tutorial was hosted by R-Ladies en Argentina; huge thanks to the organizers for their leadership and effort in making this tutorial possible.

tutorial flyer

Materials for this tutorial are available on GitHub, with two main resources in the repo:

If you start working through these materials and get stuck, you can post on RStudio Community or post a question as an issue on the repo. Our goal in designing this tutorial was to create resources for async learning.

The content for this tutorial is largely based on a new project that Emil and I are working on, which we are thrilled to publicly announce as of today: our book Supervised Machine Learning for Text Analysis in R to be published in the Chapman & Hall/CRC Data Science Series!

oh yeah

That title is a bit of a mouthful, so we like to call our project SMLTAR, which is also the URL where you can and will always be able to find the online version of this book. We invite you to take a look at the work we’ve done already, and explore how unstructured text data can be used for supervised predictive models. The book is divided into three sections.

  • Natural language features: How do we transform text data into a representation useful for modeling? In these chapters, we explore the most common preprocessing steps for text, when they are helpful, and when they are not. This section is in good shape already!

  • Machine learning methods: We investigate the power of some of the simpler and more lightweight models in our toolbox. We drew from these chapters in our useR tutorial.

  • Deep learning methods: Given more time and resources, we see what is possible once we turn to neural networks. This section is still to come.

Already, we have so many people to thank for their contributions and support, including our Chapman & Hall editor John Kimmel, the helpful technical reviewers, and Desirée De Leon for the site design of the book’s website. We hope you get a chance to check out this project!

Posted on:
July 24, 2020
Length:
2 minute read, 380 words
Categories:
rstats
Tags:
rstats
See Also:
Topic modeling for #TidyTuesday Spice Girls lyrics
Predicting viewership for #TidyTuesday Doctor Who episodes
Spatial resampling for #TidyTuesday and the #30DayMapChallenge