PREFACE: By Gilad: Last (year) I gave a tutorial at the Pydata NYC conference on my work using Python’s Networkx library and the open source graphing tool, Gephi. The tutorial covers some fundamental social network theory and then highlights a methodology which I commonly use to analyze communities on Twitter. Here I mapped out the embedded social network amongst all Twitter profiles who had the word ‘python’ in their Twitter user bios. I grabbed this data by identifying users who had been actively tweeting during the period of a week before the conference. Then I generated a graph where each node represents a Twitter users and the edges, follower/following relationships. The larger a node, the more central it is within the community. (Go here to view images/results)
Mapping Twitter’s Data Science Community
Here’s an example of a similar mapping I ran for the data science community. I used a very similar methodology to the one described above, only taking users who have one of the following phrases in their Twitter bios: Data Science, Data Scientist, Machine Learning, Data Strateg*
This resulted in a set of 1053 users who posted 14k tweets during the observed period of a week. Amongst those who posted the most were @data_nerd (659 Tweets!), @Chantel_Esworth (562) and @Da5_12 (253). Yet these three VERY NOISY profiles aren’t necessarily the most important or interesting part of the data science community. Here’s how the network looks like:
There’s a hairball-esque tight cluster that represents the majority of the identified community on Twitter, with a few offshoots (BTW – the tight cluster at the bottom right are data-strategy students in Sweden’s Hyper Island). If we dive into the main section, we can get a better understanding of the different clusters that make up the community (zoomable embedded graph below):
Each color represents a modularity class, effectively regions of the graph that are much more interconnected than the norm. The users within each modularity class tend to have some significant attribute/s in common. In the case of the python mapping above, language was the clear differentiator. Here, that’s not the case. This gets very tricky.
With Hilary Mason‘s immense help, we attempted to understand what each region of the graph means. Purple seems to be a mix of east coast and academics, while the dark blue is the west coast data drinking crew. Yellow looks like west coast social network folks while green have been doing it for a while. Although @BigDataBorat is identified within that segment… hmmm… The orange cluster is harder to nail down. Perhaps more academic, applied math and less tech-scene? @seanjtaylor seems to bridge between the two.
Remember, the clusters are based off of embedded social interactions. The fact that more people connected to each other in one portion of the graph is a significant signal. It just isn’t always easy to label it. Additionally, people move between jobs/cities all the time. The fact that someone may be highly connected to the west coast data science scene doesn’t necessarily mean that they are physically a part of it.Monica Rogati (@mrogati) is identified as more interconnected with the east coast group of dataists even though she’s out west, working at LinkedIn. This could be due to the fact that she spent many years at CMU. Or perhaps actively maintains connections to the data science community back east.
With these type of mappings, many times the community itself is much better at understanding what the segments mean. Obviously, this doesn’t necessarily represent all the important people in the community, only those who are active on Twitter. There’s inherent bias towards those who have been using Twitter for longer, as their networks tend to be more developed. Hoping to get a few friends to help me out with the classification here!
Link to Video: HERE (if embed doesn’t show)