Today I went to XSEDE Big Data Training, for short, this training is introduction to Big Data and what's going on at HPC (High Performance Cluster), where many critical computation are happening, some are national problem, some are billion dollar problem.
Whenever I want to start to learn Machine Learning, Big Data, or Deep Learning, I always confuse about a bunch of tools, I guess you already hear or know tool such as Panda, Sci-kit Learn, Pytorch, Tensorflow, Hadoop, Spark, DataBrick or programming language for data analysis such as R, etc...
And I feel like I lost in a mess. A big mess. Where do I start ?
Every time I stuck at something, there are always two case:
- I may be an idiot or
- I'm lack of necessary information
It's hard to pretend to be an idiot, so the case often is the 2nd case, I miss some information.
During the training, here is the key notes, I believe them, since they working at HPC, as a result, in front of their eyes are very high performance tools, valuable data set, therefore I'm sure they know what they are doing.
There are two tools for big data, Hadoop and Spark:
- Hadoop: Old, they say it right away, "nowadays no one use it"
- Spark: Which they recommend
The reason is performance. With Spark, an Action perform on data can stretch out to multiple cores, on multiple nodes, on multiple networks. What it works behind the scene is extremely user friendly. You don't have to dig into the source code and modify the source. Why? Because that's suck.
Meanwhile Hadoop is not even to compare with Spark in term of performance.
Life is short, time is valuable. Don't use Hadoop.
In addition, one nice thing for Spark is, it has Python API!!!
Here comes the part where people invest money and time.
When it comes to Machine Learning, there are tutorial with Pandas, Sci-kit Learn, etc...
You may ignore such tools. When you talk about Machine Learning or Big Data, which is defined in Wikipedia.
Big data is a term used to refer to data sets that are too large or complex for traditional data-processingapplication software to adequately deal with.
Yep, you the scale of data is too big, it not even fit in your laptop, in a server, that is the part where Spark shine brightly.
So, stick to Spark and Mlib.
If you read until here, you'll see that I separate data science to 3 (or more) categories, each one has its border.
In the training, I asked a question to Professor and have discussion afterward,
We saw people mentioned Pytorch for Deep Learning, we also saw people used Tensorflow, What is the differences between them?
Both works on GPU. But Tensorflow core (which is backed by Google) has higher performance and optimized matrix multiplication. Frankly, I don't recall what they evaluate Pytorch, but they don't even prepare Pytorch material for training, they go straight to Tensorflow.
What happen as HPC is what happen in industry. I'm sure they don't build a super high performance data-center to run the wrong tool.
In the final talks, he mentioned AI definition in 2018. Here is the new definition list:
- Captcha: No longer AI. Computer vision has grown to a point solving Captcha isn't AI anymore.
- Character Recognition: No longer fall into Machine Learning category.
- Chess: Only in 80s, now computer strong enough to bruteforce paths. Chess isn't AI.
- Go: Remember Google Go? No, 2018 is so new that Go isn't AI anymore.
Finally, Deep Learning & Neutral Network is the only thing defined as AI. Big Data is adrenaline to boost it.
To conclude, it's the end of 2018, and I'm thrilled to see the leap in Data Science is so far, even that I learned 2 years ago turn out to be obsolete. (Disclosure, I got a coursework degree in MIT Big Data Learning).
So if you go for Big Data, Machine Learning: Spark is the way to go.
If you go for Deep Learning: Tensorflow.
The confusion is ended here. I'm out.
Get the latest posts delivered right to your inbox.