Table of contents
No headings in the article.
This case study talks about the experience and expectations of “Data Scientists/Engineers/Analysts” at Twitter. Jimmy lin and Dmitiry Ryaboy, the authors, share priceless information about the numerous challenges they faced in their journey to scale the big data infrastructure at Twitter, and they hope that this case study helps other companies/interested parties avoid the pitfalls and the troubles they faced while they scale their infrastructure and they also hope that it gives insights to academics about the reality that happens at production level w.r.t big data analysis and help them structure their academic objectives/research accordingly.
The BIG DATA MINING cycle
Understanding the data:
In academia, we are frequently concerned with extracting features and training models in order to perform insightful analysis. This is frequently the case because the data, no matter how large, is available, even if not yet complete, but in a unified/singular format because only a small number of people are involved in the project. However, this is not the case with a global company.
Firstly, for Data Analysts to work, they will have to collect data from different teams developing different services and understand what the data means. Understand the various logging methodologies, the cases used to denote variables etc.,
They have to join these various data from different sources to make some sense out of it, which they could then denote as features to extract.
The problem here is that this consumes a lot of time for data analysts as they were not involved with the developing teams and if they treat this step lightly then the insights they generate would result in biases.
Storing the data, Schemas, and Catalog :
Naturally, MySql is the first and easiest option to store data/logs and it often works well for small projects. But this methodology is overkill when it comes to storing large amounts of data. Fortunately, there is Hadoop Distributed File System (HDFS) of a Hadoop cluster, where the data can be moved from the point where it is generated and then stored in a Hadoop warehouse. And this is where all the tools come in. Twitter uses Scribe to achieve the above. All the logs are now stored in HDFS in an orderly manner for all the data scientists to work upon.
The next problem in the queue is the structure of the logs. Again JSON will be the go-to and easiest for many developers but it is the messiest when it comes to structuring big data. Here another tool, Apache thrift, is used by Twitter to serialize the logged messages and define schemas and structures for the various logs stored.
Lastly, the work doesn't stop at serializing, everybody in the organization, who want to work with the serialized data available at the warehouse, should know what those schemas represent. And HCatalog is used to achieve the above.
Once the data is ready, the next operation is to perform apply ML algorithms to train a model, the downside here is that most of the algorithms are designed for sequential data that fit into memory and for these to work with big data stored in a warehouse, once must sample subset of data features and then train but this defeats the whole purpose of big data analysis. Here is where Scalable ML comes into the picture.
The data mining cycle has to happen in a synchronized manner at various intervals. Achieving this is a big problem where multiple complex systems with their own features are involved in making up a product and in this case “Twitter”. And the various set of challenges that occur while making this synchronization a reality is termed plumbing.
Integrated Soluntions / Scalable ML:
To deal with plumbing issues, companies frequently integrate the ML components into other tools, such as Twitter, which integrated the ML components into pig and now ML is just another pig script. As a result, the ML is easily integrated with existing infrastructure and has access to all pig libraries.
Conclusions / To Explore
- Cleaning and making sense of data is important
- Active participation of data analysts while developing software is a must.
- Unified coding practices across the organization is a must but hard to achieve in an MNC.
- More sophisticated tools should be developed to minimize the cleaning/aggregating of data.
- More academic ML research needs to be conducted to design/re-vamp algorithms to work with big data.
- How to achieve real-time interactions with a big dataset?
- How to make a big dataset available at the browser level for developers/analysts?
P.S., I would highly recommend you read this paper. A very intresting read indeed. My summary has done no justice to the amount of information present in it.
- Publication : ACM SIGKDD Explorations Newsletter, Volume 14, Issue 2
- Date : December 2012