Software Development
Random thoughts on big data
I began blogging in 2005, back then I managed to post something new almost everyday. Now, 10 years after, I hardly post anything. I was beginning to think I donât have anything left to say but I recently noticed I have quite a few posts in various states of âdraftâ. I guess that  I am spending too much thinking about how to get a polished idea out there, rather than just go on and write whatâs on my mind. This post is an attempt to change that by putting some thought I have (on big data in this case) without worrying too much on how complete and polished they are.
Anyway, here we go:
- All data is time-series â When data is added to the big data store (Hadoop or otherwise) it is already historical i.e. it is being imported from a transactional system, even if it is being streamed into the platform. If you treat the data as historical and somehow version it (most simplistic is adding timestamp) before you store it it would enable you to see how this data changes over time  â and when you relate it to other data youâd be able to see both the way the system was at the time of that data particular data (e.g. event) was created as well as getting the latest version and see its state now. Essentially  treating all data as slow changing dimensions gives you enhanced capabilities when youâd want to analyse the data later.
- Enrich data with âforeign keysâ before persisting it (or close to it). Usually a data source is not stand alone and it can be related to other data â either form the same source or otherwise. Solving some of these correlations when the data is fresh and context is known can save a lot of time doing them later both because youâd probably do these correlations multiple times instead of once as well as because when the data is ingested the context and relations are more obvious than when, say, a year later you try to make sense of the data and recall its relations to other data.
- Land data in a queue â This ties nicely to the previous two points as the type of enrichments mentioned above are suited well for handling in streaming. if you lend all data in a queue you can gain a unified ingestion pipeline for both batch and streaming data. Naturally not all the computations can be handled in streaming, but youâd be able to share at least some of the pipeline.
- Lineage is important (and doesnât get enough attention) â raw data is just that but to get insights you need to process it , enrich and aggregate it â a lot of times this creates a disconnect between the end result and the original data. Understanding how insights were generated is important both for debugging problems as well as ensuring compliance (for actions that demand that).
- Not everything is big data â Big data is all the rage today but a lot of problems donât require that. Not only that when you make the move to a distributed system you both complicate the solution and more importantly slow the processing (until, of course, you hit the threshold where the data canât be handled by a single machine). This is even truer for big data systems where you have a lot of distributed nodes so the management (coordination etc.) is more costly (back at Nice we referred to  the initialization time for Hadoop jobs as âwaking the elephantâ  and we wanted to make sure we need to wait it).
- Donât underestimate scale-up â A related point to the above. Machines today are quite powerful and when the problem is only âbigishâ it might be that a scale-up solution would solve it better and cheaper. Read for example âScalability! But at what COST?â by Frank McSherry and âCommand-line tools can be 235x faster than your Hadoop clusterâ by Adam Drake as two examples for points 5 & 6
This concludes this batch of thoughts. Comments and questions are welcomed
| Reference: | Random thoughts on big data from our NCG partner Arnon Rotem Gal Oz at the Cirrus Minor blog. |

