I began blogging in 2005, back then I managed to post something new almost everyday. Now, 10 years after, I hardly post anything. I was beginning to think I don’t have anything left to say but I recently noticed I have quite a few posts in various states of “draft”. I guess that I am spending too much thinking about how to get a polished idea out there, rather than just go on and write what’s on my mind. This post is an attempt to change that by putting some thought I have (on big data in this case) without worrying too much on how complete and polished they are.
Anyway, here we go:
- All data is time-series – When data is added to the big data store (Hadoop or otherwise) it is already historical i.e. it is being imported from a transactional system, even if it is being streamed into the platform. If you treat the data as historical and somehow version it (most simplistic is adding timestamp) before you store it it would enable you to see how this data changes over time – and when you relate it to other data you’d be able to see both the way the system was at the time of that data particular data (e.g. event) was created as well as getting the latest version and see its state now. Essentially treating all data as slow changing dimensions gives you enhanced capabilities when you’d want to analyse the data later.
- Enrich data with “foreign keys” before persisting it (or close to it). Usually a data source is not stand alone and it can be related to other data – either form the same source or otherwise. Solving some of these correlations when the data is fresh and context is known can save a lot of time doing them later both because you’d probably do these correlations multiple times instead of once as well as because when the data is ingested the context and relations are more obvious than when, say, a year later you try to make sense of the data and recall its relations to other data.
- Land data in a queue – This ties nicely to the previous two points as the type of enrichments mentioned above are suited well for handling in streaming. if you lend all data in a queue you can gain a unified ingestion pipeline for both batch and streaming data. Naturally not all the computations can be handled in streaming, but you’d be able to share at least some of the pipeline.
- Lineage is important (and doesn’t get enough attention) – raw data is just that but to get insights you need to process it , enrich and aggregate it – a lot of times this creates a disconnect between the end result and the original data. Understanding how insights were generated is important both for debugging problems as well as ensuring compliance (for actions that demand that).
- Not everything is big data – Big data is all the rage today but a lot of problems don’t require that. Not only that when you make the move to a distributed system you both complicate the solution and more importantly slow the processing (until, of course, you hit the threshold where the data can’t be handled by a single machine). This is even truer for big data systems where you have a lot of distributed nodes so the management (coordination etc.) is more costly (back at Nice we referred to the initialization time for Hadoop jobs as “waking the elephant” and we wanted to make sure we need to wait it).
- Don’t underestimate scale-up – A related point to the above. Machines today are quite powerful and when the problem is only “bigish” it might be that a scale-up solution would solve it better and cheaper. Read for example “Scalability! But at what COST?” by Frank McSherry and “Command-line tools can be 235x faster than your Hadoop cluster” by Adam Drake as two examples for points 5 & 6
This concludes this batch of thoughts. Comments and questions are welcomed