Ultimately, data science matters because it enables companies to operate and strategize more intelligently. It is all about adding substantial enterprise value by learning from data. "Data Scientist" has become a popular occupation with Harvard Business Review dubbing it "The Sexiest Job of the 21st Century" and McKinsey & Company projecting a global excess demand of 1.5 million new data scientists. Еvery organization has to decide its own trade-offs about data scientists department/team. For some, the benefits of an autonomous stand-alone team outweigh the risk of that team being marginalized. The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. For others, the organizational alignment of an integrated team doesn’t justify the challenges that model creates around hiring and culture. It’s up to you to pick the model that works best for your company.
After Harvard Business Review analyzed sexiest jobs in 21 century they found positive shift in analytics. Whereas companies once maintained tight control over data warehouses, they are now shifting toward more agile analytic environments because the drive for data-driven decision-making has catalyzed the need for a different type of work. Today, data quality is no longer about a central truth but is instead dependent on the goal of the analytic tasks. Exploratory analysis and visualization require that analysts can fluidly access disparate sources of data in various formats.
Instead of solving these problems, organizations are often adding to the amount of data that require a data scientist’s attention. Through activity and system logs, 3rd-party APIs and vendors, and other publicly available data, companies have access to an increasingly large and diverse set of data sources. But without the right systems in place, the prohibitive cost of data manipulation leaves much of this data dormant in “data lakes.”
Here is list of important things that data scientist do:
- Problems first, NOT tools / methods
- No one cares how you did it
- Measure carefully
- Write everything down
- Put questions / problems first
- Generate useful projects
- Have a well-understood role
Integrated data scientists lack the autonomy and visibility they would have in a stand-alone team, and the head of data science (if there is one) risks being a figurehead rather than a true leader. Indeed, the leader of an integrated team needs to be someone who can effectively manage both engineers and data scientists. In addition, integrating data scientists into established teams is a less flexible approach than embedding them on an as-needed basis. Finally, the lack of a core data science team in an organization can create challenges around hiring, knowledge sharing, and career development. Specifically, if data scientists are a minority within an organization dominated by engineers, there’s a risk that they’ll get the short end of the cultural stick.
Data science is engineering discipline. Data scientists often deal in prototypes or proofs of concept. Prototype (n): Tomorrow's production, if it works. Few questions to ask?
- How is the training data sampled?
- How do I evaluate my work?
- Do we document thoroughly, including failures (model that you tried that do not predict well)?
- Do we have balance short term and long-term projects?
- Do you have fully centralized Data Science team?
Let's think about advantages of independent data science team:
Pros: Freedom for data scientists, opportunities for research, long-term thinking, idea-generation.
Cons: May lead to methods-first thinking, “Butterfly chasing”
A group of University of Pennsylvania researchers who analyzed Facebook status updates of 75,000 volunteers have found an entirely different way to analyze human personality, according to a new study published in PLOS One. The volunteers completed a common personality questionnaire through a Facebook application and made their Facebook status updates available so that researchers could find linguistic patterns in their posts. Drawing from more than 700 million words, phrases, and topics, the researchers built computer models that predicted the individuals’ age, gender, and their responses on the personality questionnaires with surprising accuracy.
Another example is how Intel is investing in big data. In the near future machine learning technology is going to take over all other technologies. That’s why Intel is looking to capitalize on the same in order to cater the rest of the world much earlier for making more profits.
He elaborated about how Intel is also using big data in order to inform better sales and marketing decisions, too. The idea there is to collect historical data about Intel’s 140,000 customers in order to let sales reps focus on the right ones (kind of like an internal version of Infer). One part of this process is a similarity analysis of sorts to find customers that have the same types of buying patterns or perhaps similar needs, kind of like how Amazon recommends products that are often purchased together or viewed by the same people.
ML algorithms can be grouped into families based on the type of question they answer. These can help guide your thinking as you are formulating your razor sharp question.
Machine learning is the science of getting computers to act without being explicitly programmed. Machine learning has given us self-driving cars, practical speech recognition and effective web searches. The process of machine learning is similar to that of data mining. Both systems search through data to look for patterns. However, instead of extracting data for data scientists, machine learning uses that data for the computers own use. Machine learning programs detect patterns in data and adjust program actions accordingly. Data mining is an analytical process designed to explore data, large amounts of data. Machine learning is so powerful that even a very basic algorithm can do wonders in predicting and classifying information. On the other hand, there are a lot of people who have a great amount of data on their hands which they want analyzed. These ideas complement each other nicely. If you are looking for such simple, hands-on project to start exploring your skills, search around for such projects on Quora, Coursera or any stackexchange site or mailing list. If you are an active blog, try to see if you can predict where your audience traffic will come from, given the text of a blog post. If you are active on twitter, try to predict how many people will retweet a tweet of yours. If you are active on Facebook, try to predict how many people will like a post.
Data mining is especially important for business managers because the data mined is usually marketing/business data. Data mining is also mainly used to analyze user behavior by searching for patterns and/or systematic relationships between variables, and then validating the findings by applying the detected patterns to new subsets of data, the ultimate goal here is prediction. Generally speaking, people who currently behave in the same way as other people did in the past, will perform the same future actions as the original group performed in the past. Taking shopping cart abandonment as an example: say your average abandonment rate has been 60%, but in the past people who were associated with three specific variables only had a 40% abandonment rate. We can assume that other people who can today be associated with those three variables will probably show the same 40% abandonment rate. These variables could be demographic, like gender and age, or behavioral, like purchasing specific items or clicking on certain links.
Is this A or B? This family is formally known as two-class classification. It’s useful for any question that has just two possible answers: yes or no, on or off, smoking or non-smoking, purchased or not. Lots of data science questions sound like this or can be re-phrased to fit this form. It’s the simplest and most commonly asked data science question. Will this customer renew their subscription?
Is this A or B or C or D? This algorithm family is called multi-class classification. Like its name implies, it answers a question that has several (or even many) possible answers: which flavor, which person, which part, which company, which candidate. Most multi-class classification algorithms are just extensions of two-class classification algorithms. Here is example: Are these voltages normal for this season and time of day? Which animal is in this image?
How Much / How Many? When you are looking for a number instead of a class or category, the algorithm family to use is regression. For example: What will the temperature be next Tuesday?
Multi-Class Classification as Regression Sometimes questions that look like multi-value classification questions are actually better suited to regression. For instance, “Which news story is the most interesting to this reader?” appears to ask for a category—a single item from the list of news stories. However, you can reformulate it to “How interesting is each story on this list to this reader?” and give each article a numerical score. Then it is a simple thing to identify the highest-scoring article. Questions of this type often occur as rankings or comparisons. Which 5% of my customers will leave my business for a competitor in the next year?
Two-Class Classification as Regression It may not come as a surprise that binary classification problems can also be reformulated as regression. (In fact, under the hood some algorithms reformulate every binary classification as regression.) This is especially helpful when an example can belong part A and part B, or have a chance of going either way. When an answer can be partly yes and no, probably on but possibly off, then regression can reflect that. Questions of this type often begin “How likely…” or “What fraction…” How likely is this user to click on my ad?
How is this Data Organized? Questions about how data is organized belong to unsupervised learning. There are a wide variety of techniques that try to tease out the structure of data. One family of these perform clustering, a.k.a. chunking, grouping, bunching, or segmentation. They seek to separate out a data set into intuitive chunks.
What Should I Do Now? A third extended family of ML algorithms focuses on taking actions. These are called reinforcement learning (RL) algorithms. They are little different than the supervised and unsupervised learning algorithms. A regression algorithm might predict that the high temperature will be 98 degrees tomorrow, but it doesn’t decide what to do about it.
Problems that can be solved with data science:
- Customer segmentation
- Identify the tipping point to retain users for collaboration
- Personalize sharing suggestions and feature promotion or marketing campaign
- Customized, patient-specific medications and diets
- Predicting Earthquakes
- Automated piloting (drones, cars without pilots)
- Sport bets
- Predicting oil demand, oil reserves, oil price, impact of coal usage
- Predicting volcano risk, to evacuate populations or cancel flights, while minimizing expenses caused by these decisions
- Predicting book sales, determining correct price, price elasticity and whether a specific book should be accepted or rejected by a publisher, based on projected ROI
- Algorithm to predict duration of a road trip, doing much better than GPS systems not connected to the Internet.
- Asteroid risks
- Road constructions, HOV lanes, and traffic lights designed to optimize highway traffic.
- Actuarial science: predict your death, and health expenditures, to compute your premiums (based on which population segment you belong to).
Markets and sectors demanding data scientists
Due to this special blend of experience and knowledge, finding professionals that meet the market’s challenges is complicated. So much so, that the industry refers to them as “unicorns”. However, the demand for information and training in this field is increasing. A simple search on the jobs platform Indeed shows how the interest in this discipline has come a long way since 2011.
Ultimately, data science matters because it enables companies to operate and strategize more intelligently. It is all about adding substantial enterprise value by learning from data. One very important aspect in data science is predictive analytics. When faced with a business problem, you should be able to assess whether and how data can improve performance. Data Scientists need to be part statistician, part hacker, part engineer, part data analyst, part business consultant, part artist, part story teller. The data scientist to be able to talk with multiple parts of the business, gather all the data, connect the dots and look for and spot the most relevant insights and then translate them to actionable suggestions. Understanding the fundamental concepts, and having frameworks for organizing data-analytic thinking will help to envision opportunities for improving data-driven decision-making, or to see data-oriented competitive threats.
- The best stats you’ve ever seen
- How I hacked online dating
- What makes a good life? Lessons from the longest study on happiness
The data scientist must have knowledge in applied science, with an extensive experience in its industry, and training in science. The market is no longer what it used to be. Almost every industry is being affected by the sheer volume and ubiquity of Big Data – and no business is immune. This lack of knowledge, from the business manager’s side, in the data science field is much more damaging because the data science is supporting bottom line decision making. Firms where the business people do not understand what the data scientists are doing are at a substantial disadvantage, because they waste time and effort or, worse, because they ultimately make wrong decisions. Data science has enabled us to solve complex and diverse problems by using machine learning and statistic algorithms. Data science job descriptions are so varied that they are hard to compare. Clearly defined or not, the energy around data science is enormous; universities are launching training and research facilities, municipalities such as New York and Seattle are competing to become the center of the data science world, and vendors such as Cloudera have launched data science certification programs. Data science is an exciting area that promises many real benefits to organizations in nearly all industries. It should not be stuck in a frustrating pursuit of imaginary heroes of folklore.