Blog

Machine learning, text analysis, and more

New sports from random emoji

I love emoji ❤️ and I love xkcd, so this recent comic from Randall Munroe was quite a delight for me. I sat there, enjoying the thought of these new sports like horse hole and multiplayer avocado and I thought, “I can make more of these in just the barest handful of lines of code”. This is largely thanks to the emo package by Hadley Wickham, which if you haven’t installed and started using yet, WHY NOT?

November 25, 2017

Word Vectors with tidy data principles

Last week I saw Chris Moody’s post on the Stitch Fix blog about calculating word vectors from a corpus of text using word counts and matrix factorization, and I was so excited! This blog post illustrates how to implement that approach to find word vector representations in R using tidy data principles and sparse matrices. Word vectors, or word embeddings, are typically calculated using neural networks; that is what word2vec is.

October 30, 2017

From Power Calculations to P-Values: A/B Testing at Stack Overflow

Note: cross-posted with the Stack Overflow blog. If you hang out on Meta Stack Overflow, you may have noticed news from time to time about A/B tests of various features here at Stack Overflow. We use A/B testing to compare a new version to a baseline for a design, a machine learning model, or practically any feature of what we do here at Stack Overflow; these tests are part of our decision-making process.

October 17, 2017

Mapping ecosystems of software development

I have a new post on the Stack Overflow blog today about the complex, interrelated ecosystems of software development. On the data team at Stack Overflow, we spend a lot of time and energy thinking about tech ecosystems and how technologies are related to each other. One way to get at this idea of relationships between technologies is tag correlations, how often technology tags at Stack Overflow appear together relative to how often they appear separately.

October 3, 2017

tidytext 0.1.4

I am pleased to announce that tidytext 0.1.4 is now on CRAN! This release of our package for text mining using tidy data principles has an excellent collection of delightfulness in it. First off, all the important functions in tidytext now support support non-standard evaluation through the tidyeval framework. library(janeaustenr) library(tidytext) library(dplyr) input_var <- quo(text) output_var <- quo(word) data_frame(text = prideprejudice) %>% unnest_tokens(!! output_var, !! input_var) ## # A tibble: 122,204 x 1 ## word ## <chr> ## 1 pride ## 2 and ## 3 prejudice ## 4 by ## 5 jane ## 6 austen ## 7 chapter ## 8 1 ## 9 it ## 10 is ## # .

September 30, 2017