Data Science for Business Analysis | Complete Guide | Courses | Basics of Statistics in Data Science

Introduction:

Before going into the deep study of "Data Science with Statistics", first let's understand the meaning of Statistics. "Statistics is a branch of applied mathematics that involves the collection, description, analysis, and inference of conclusions from quantitative data".


We can do a lot with the data like analysing, predicting, and so on. The most important thing is choosing the model that should be suitable for both the business problem and the data. For some budding statisticians, they will have confusion on which model suits on which data for the general analysis. Here, I am sharing some key points, where to use the adequate model. Before getting into the modeling part, first, we get some information about the basics of statistical terms.

Recently everyone is talking about Data. After hearing the word “Data” the basic questions that arise in our minds are,



What is Data?


How Data is collected?


How Data can be analyzed?


How Data is interpreted?


To answer all these questions, the term “Statistics” is used. Statistics is the basic and important tool to deal with the data. Now coming to the important parts of statistics, it involves the collection, descriptive, analysis and concludes the data.


What I mean as the basic statistical terms are Descriptive, Predictive, and Prescriptive.


Descriptive, Predictive, and Prescriptive Statistics:

In the Descriptive model, the values or data with converted to a diagrammatic representation. This makes the statistician or the audience, understand the overall view of the dataset. After getting the idea about the data, we can proceed to the main appropriate model. Not going to make any changes in the data but converting to a diagrammatic representation and the data is as it is.



Predictive, as the name defines, we are going to predict something with the data we hold. The keyword that needs to pop, when you see something related to predictive is “How much/ How many”. The expectation of something is on the term predictive. In some business problem, we can frame the keyword as, how much the profit will be in next week? Or how many products can be expected to sell next month? We should analyse the business situations and bring out solutions. Unlike Descriptive, Predictive needs some effort.


Prescriptive is the next phase of Predictive. Rather than stopping at “How much? Or How many?”, we can further develop with the question, “How to make something happen?”. For instance, to increase our sales, What we can do? Or What we can do to innovate the product? Or What are the possible ways to achieve the customer’s needs? This prescriptive type of model comes under the suggestion category. Giving suggestions is not that much easier as we think. Of course, a lot of work and analysis should be taken part for giving the worthful suggestion. Prescriptive is the most important part.


Model or Analysis:

Different types of data can be adopted with different types of statistical analysis. Some of the major models in statistics will be discussed here. In this article, let’s have some notes, only on the major model and when to use them.

We'll use simple statistics to analyse and interpret the results.


Correlation:



There will a finding between everything in the data world. Using the correlation technique is one of the ways for comparing and get how strongly one variable is related to the other variable. Correlation analysis helps in finding the relationship between two or more variables. Variables are nothing but factors. Correlation analysis works both in quantitative and qualitative data. Simply, the Correlation value lies between -1 and +1. How close the correlation value to +1, is that much strong positive correlation is present between the variables, and how close the correlation value to -1, is that much strong negative relationship between the variables. 0 defines no correlation between the variables/factors.


When one variable increases with the other variable, shows a positive correlation. (E.g., As sales increases profit will also increase). When one variable increases, the other tends to decrease, which is the negative correlation. (E.g., As employees work lazily, production decreases). This model sometimes also helps us in reducing the variable count. Like, we can choose the variable that has a strong relationship for doing more research. So, we can work with the really needed variables.



Regression:


Regression analysis is useful when we need to find the dependencies of one variable on the other. This will also work for qualitative and quantitative data. The regression value lies between 0 and +1. +1 defines the perfect fit and 0 defines no fitting at all. This gives us how well one variable is dependent on the other variable. Using the regression analysis, we can perform predictive modeling.


For example, a best educational institution is dependent on the qualified workers/employees, good environment, latest technology, etc., for such data we can prefer using the Regression analysis and this is a more appropriate model. This gives us how much a dependent variable is dependent on other independent variables. As per our example, the dependent variable is the institute and the independent variables are employees, environment, technology, etc. So, do you want to know how one variable is dependent on another, then one of the best options to go for is Regression.


Survival Analysis:

If the data contains time duration and some incident occurrence, we can go for survival analysis. Survival analysis is most useful in the clinical trial data and in product life expectancy (E.g., how much time will an expected incident takes to happen). The most common term used in the survival analysis is censoring and is one of the major problems in survival analysis. Censoring means missing complete information on the survival time (i.e., we have the information on duration but not complete information).


Survival analysis should have a bivariate variable. Considering industrial data, we will have the time period of how long a machine can work. The dataset will contain the time and whether the machine failed or not as an event. Here, the event will be denoted as 0 and +1. In this case, prefer Survival Analysis. When we talk about survival analysis, the value lies between 0 and +1, which is plotted with the time period. Using the plotted graph, we can interpret the data.


Depending on the data and business problem, choosing the analysis may vary. There are a lot of ways to do the analysis. But through statistics it becomes more easy to analyse, based on the output, drawing the inference is most important and is where more attention to be provided. Hope you found some idea about, when and where to use the respective models.


There are two types of Statistics: Descriptive and Inferential Statistics. 

In Descriptive Statistics, from the given observation the data is summarised. The data summarization takes place by considering the sample from the population using the mean or standard deviation.

There are four different categories in Descriptive Statistics. They are,

Measure of frequency

Measure of central tendency

Measure of dispersion

Measure of position

Based on the number of times a particular data has occurred defines the measure of frequency. The mean, median, mode, skewness of the respective data comes under the measure of central tendency.

The Measure of dispersion can be defined based on the Range, Standard Deviation variance, etc. Finally, based on the percentile and quartile the position is measured.


Then looking into Inferential Statistics, once the data is collected, tabulated, and analyzed the summary or the inference is derived by using inferential statistics. The inferences are drawn based upon sampling variation and observational error.

Based on the information and conclusion derived from the sample the inferential statistics help us to predict and estimate results for the population.


Finally, I wish to quote the words of Seth Godin – “Data is not useful until it becomes information”

Courses that I recommend you for Statiscal Analysis on Data Science is:


1. Advanced Statistics for Data Science.

2. Statistics for data science and business analysis.

3. Master Data Science with R & Python.

4. Mathematics for machine learning.

5. Become Data Scientist.

You can also learn more about business analysis and Data Science with R & Python from my another blogspot namely, "Business Analytics 2021 - Masterclass". Pls go through this, you will get more idea about Business Analysis.

Hope you all found some basic ideas about statistics and statiscal analysis for data science. Statistics brings most out of the data that can make a useful information for various purposes. 

Thank You

Post a Comment

0 Comments