Thursday 29 November 2012

What is Data?



Data are measurements or observations that are collected as a source of information.It can be numbers, words, measurements, observation or just description of things. Example: 36 degrees, cold, number of hospital beds, height, weigtht, age, level of severity of disease

A Variable is a characteristic or attribute of an information that describes a person, place, thing, or idea and can assume different values. The value of the variable can "vary" from one entity to another
Example - temperature of a room, a person's hair color is a potential variable, which could have the value of "blond" for one person and "brunette" for another




We could distinguish between two different variables Based on the Level of Measurement 
Quantitative Variable
A quantitative variable is one in which the variates differ in magnitude, e.g. income, age, etc.
Qualitative/Categorial Variable
A qualitative variable is one in which the variates differ in kind rather than in magnitude, e.g. marital status, gender, nationality, etc.
Qualitative Data
Quantitative Data
Overview:

  • Deals with descriptions.
  • The variates differ in kind rather than magnitude
  • Data can be observed but not measured.
  • Colors, textures, smells, tastes, appearance, type, etc.
  • Qualitative Quality
Overview:
  • Deals with numbers.
  • The variates differ in magnitude
  • Data which can be measured.
  • Length, height, area, volume, weight, speed, time, temperature, humidity, sound levels, cost, members, ages, etc.
  • Quantitative Quantit
Example :
Latte
 Qualitative data:
  • robust aroma
  • frothy appearance
  • strong taste
  • burgundy cup

Example :
Latte
Quantitative data:
  • 12 ounces of latte
  • serving temperature 150ยบ F.
  • serving cup 7 inches in height
  • cost $4.95
 
 
Based on Statistical model there are two kinds of variable, 
 
Response Variable
The outcome of a study or . A variable you would be interested in predicting or forecasting. Often called a dependent variable whose value is dependent on the  predicted variable.
Explanatory Variable
Any variable that explains the response variable. Often called an independent variable or predictor variable.
Based on number of variables in a study, we have the following types of data,

Univariate Data
Involves a single variable does not deal with cause or relationship. The major   purpose of univariate is to describe data 

·         Central tendency - mean, mode, median
·         Dispersion - range, varience, max, min, quartiles, standard deviation
·         Frequency Distribution 
·         Bar graph, histogram, pir-chart, line graph, box-and-whiskers plot  

Bivariate Data

Involves two variables, deals with causes and relationship. The major purpose of the bivariate is to explain  

·         Analysis of two variables simultaneously
·         Correlation, comparisons, causes, relationships, explanations 
·         Tables where one variable is contigent on the values of the other variable
·         Independent and dependent variables
 

Data Types 
data type or simply type is a classification identifying one of various types of data, that determines the possible values for that type; the operations that can be done on values of that type; the meaning of the data; and the way values of that type can be stored

Almost all programming languages explicitly include the notion of data type, though different languages may use different terminology. Common data types may include:
  • Intergers
  • Booleans
  • Characters
  • Floating numbers
  • Alphanumeric strings
Basic Data Types in R

Every individual data value has a data type that tells us what sort of value it is. The most common data types are NUMBERS, which R calls numeric values, and TEXT, which R calls character values. R also has LOGICAL values.



SAS Data Types

Internally, SAS supports two data types for storing data: CHAR  (fixed-length character data, 32,767-character maximum ) NUM (double-precision floating-point number )
Note: If the data field is longer than 254 characters, the SAS ODBC Driver processes it as the ODBC data type SQL_VARCHAR.
By using SAS format information, the SAS ODBC Driver is able to represent other ODBC data types, both when responding to queries, and in CREATE TABLE statements

CREATE TABLE
Data Type Name

ODBC Data Type

SAS Data Type
char(w)
SQL_CHAR
CHAR(w)
num(w, d)
SQL_DOUBLE
NUM
num(w, d)
SQL_FLOAT
NUM
integer
SQL_INTEGER
NUM FORMAT=11.0
date9x
SQL_DATE
NUM FORMAT=DATE9X.
datetime19x
SQL_TIMESTAMP
NUM FORMAT=DATETIME19X.
time8x
SQL_TIME
NUM FORMAT=TIME8X.


JMP Data Types

Every field in a JMP file has a name and a data type. The data type indicates how much physical storage to set aside for the field and the format in which the data is stored.
CHARACTER
specifies a field for character string data. The maximum length is 255 characters. Characters can be letters, digits, spaces, or special characters.
META
specifies how metadata contained in the specified data set is processed.
NUMERIC
specifies an 8-byte floating point number. This is also called a double precision number. When you are reading data, this maps directly to the SAS double precision number. When you are writing data, all SAS numeric variables (regardless of length) become JMP numeric variables.
ROWSTATE
specifies an integer variable that takes on the value of 1 or missing. When you are reading data, this maps to a SAS double precision number.
DATE
specifies the date format. When you are reading data, the date values are mapped to a SAS number and scaled to the base date. The JMP date display format maps to the appropriate SAS date display format. When you are writing data, the SAS output format for the numeric variable is checked to determine whether it is a date format. If so, the SAS numeric value is scaled to a JMP date value with the appropriate date display format.
DATETIME
specifies the datetime format. When you are reading data, the datetime values are mapped to a SAS number and scaled to the base datetime. The JMP datetime display format maps to the appropriate SAS datetime display format. When you are writing data, the SAS output format for the numeric variable is checked to determine whether it is a datetime format. If so, the SAS numeric value is scaled to a JMP datetime value with the appropriate datetime display format.
TIME
specifies the time format. When you are reading data, the time values are mapped to a SAS number and scaled to the base time. The JMP time display format maps to the appropriate SAS time display format. When you are writing data, the SAS output format for the numeric variable is checked to determine whether it is a time format. If so, the SAS numeric value is scaled to a JMP time value with the appropriate time display format.


Data Types in STATA

Each element of data is said to be either type string or numeric. 
STRING:  Associated with each data type is a storage type. Say Str1, Str2, Str3...etc.
NUMBER: Numbers are stored as byte, int, long, float, or double, with the default being float.  byte, int, and long are said to be of integer type in that they can hold only integers.

 

Data Structures

A data structure is an actual implementation of a particular abstract data type.
Because many different languages approach the construction of data structures differently

Data structure refers to the way data is organized and manipulated. It seeks to find ways to make data access more efficient. When dealing with data structure, we not only focus on one piece of data, but rather different set of data and how they can relate to one another in an organized manner.

Examples: Arrays, Lists, Iterators, Stacks & Queues, Binary trees, Min & Max Heaps, Graphs, Hash Tables, Sets and Tradeoffs.

Basic data structures in R

Vectors 
A collection of values that all have the same data type. The elements of a vector are all numbers, giving a numeric vector, or all character values, giving a character vector.
A vector can be used to represent a single variable in a data set.
Image numvec


Factors  
A collection of values that all come from a fixed set of possible values. A factor is similar to a vector, except that the values within a factor are limited to a fixed set of possible values.
A factor can be used to represent a categorical variable in a data set.
Image factor
Matrices  
A two-dimensional collection of values that all have the same type. The values are arranged in rows and columns.
There is also an array data structure that extends this idea to more than two dimensions.
Image matrix


Data frames
 
A collection of vectors that all have the same length. This is like a matrix, except that each column can contain a different data type.
A data frame can be used to represent an entire data set.
Image df


Lists
 
A collection of data structures. The components of a list can be simply vectors--similar to a data frame, but with each column allowed to have a different length. However, a list can also be a much more complicated structure.
This is a very flexible data structure. Lists can be used to store any combination of data values together.
Image list






Friday 23 November 2012

Statistical Methods - A Overview





Statistics is the branch of mathematics concerned with collection, classification, analysis, and interpretation of numerical facts, for drawing inferences on the basis of their quantifiable likelihood (probability). Statistics can interpret aggregates of data too large to be intelligible by ordinary observation because such data unlike individual quantities tend to behave in regular, predictable manner. It is subdivided into descriptive statistics and inferential statistics. Statistical Procedures can be divided into two major categories: Applied Statistics and Theoretical Statistics.

Applied Statistics compromise both Descriptive statistics and the application of inferential statistics (a.k.a., predictive statistics)
Theoretical statistics concerns both the logical arguments underlying justification of approaches to statistical inference, as well encompassing mathematical statistics.

Before going into the details we must be familiar with two important concepts: Population and Sample. A population is the total set of individuals, groups, objects, or events that the researcher is studying. A sample is a relatively small subset of people, objects, groups, or events, that is selected from the population. In short a subset of the population is called sample. It is a proportion of the population, a slice of it, a part of it and all its characteristics. A sample is a scientifically drawn group that actually possesses the same characteristics as the population – if it is drawn randomly.(This may be hard for you to believe, but it is true!) .
Example: Like if you are cooking a pot of soup(population), and you take a spoon full(sample) to see how it tastes. So although you didn't eat the entire pot of soup, you have a general idea of how it tastes

Descriptive Statistics

Descriptive statistics includes statistical procedures that we use to describe the population we are studying. The data could be collected from either a sample or a population, but the results help us organize and describe data. Descriptive statistics can only be used to describe the group that is being studying. That is, the results cannot be generalized to any larger group. This is the statistical method that is used for summarizing or describing a collection of data.
Examples: Frequency distribution, Measures of central tendencies (mean, median, mode) and graphs like pie charts and bar charts that describes the data.
Inferential Statistics
This is the branch of statistics that is used to make inferences or predictions about the characteristics of a population based on analysis and observation of sample data. That is, we can take the results of an analysis of a sample and can generalize it to the larger population that the sample represents. In order to do this, however, it is imperative that the sample is representative of the group to which it is being generalized.

To address this issue of generalization, we have tests of significance. A Chi-square or T-test, for example, can tell us the probability that the results of our analysis on the sample are representative of the population that the sample represents.

Examples: Linear Regression Analysis, Logistic Regression Analysis, ANOVA, Correlation Analysis, Structural Equation Modelling and Survival Analysis to name a few.

Inference is a vital element of scientific advance, since it provides a means for drawing conclusions from data that are subject to random variation. To prove the propositions being investigated further, the conclusions are tested as well, as part of the scientific method.

Mathematical statistics includes not only the manipulation of probability distributions necessary for deriving results related to methods of estimation and inference, but also various aspects of computational statistics and the design of experiments


Wednesday 21 November 2012

Application of Statistics in Everyday Life


Statistics are sets of mathematical equations that are used to analyze what is happening in the world around us. You've heard that today we live in the Information Age where we understand a great deal about the world around us. Much of this information was determined mathematically by using statistics. When used correctly, statistics tell us any trends in what happened in the past and can be useful in predicting what may happen in the future.

1. Weather Forecasts
Do you watch the weather forecast sometime during the day? How do you use the information? Have you ever heard the forecaster talk about weather models? These computer models are built using statistics that compare prior weather conditions with current weather to predict future weather.

2. Emergency Preparedness
What happens if the forecast indicates that a hurricane is imminent or that tornadoes are likely to occur? Emergency management agency’s move into high gear to be ready to rescue people. Emergency teams rely on statistics to tell them when danger may occur.
3. Predicting Disease
Lots of times on the news reports, statistics about a disease are reported. If the reporter simply reports the number of people who either have the disease or who have died from it, it's an interesting fact but it might not mean much to your life. But when statistics become involved, you have a better idea of how that disease may affect you.
For example, studies have shown that 85 to 95 percent of lung cancers are smoking related. The statistic should tell you that almost all lung cancers are related to smoking and that if you want to have a good chance of avoiding lung cancer, you shouldn't smoke.

4. Medical Studies
Scientists must show a statistically valid rate of effectiveness before any drug can be prescribed. Statistics are behind every medical study you hear about.
5. Genetics
Many people are afflicted with diseases that come from their genetic make-up and these diseases can potentially be passed on to their children. Statistics are critical in determining the chances of a new baby being affected by the disease.

6. Political Campaigns
Whenever there's an election, the news organizations consult their models when they try to predict who the winner is. Candidates consult voter polls to determine where and how they campaign. Statistics play a part in who your elected government officials will be
7. Insurance
You know that in order to drive your car you are required by law to have car insurance. If you have a mortgage on your house, you must have it insured as well. The rate that an insurance company charges you is based upon statistics from all drivers or homeowners in your area.
8. Consumer Goods
Wal-Mart, a worldwide leading retailer, keeps track of everything they sell and use statistics to calculate what to ship to each store and when. From analyzing their vast store of information, for example, Wal-Mart decided that people buy strawberry Pop Tarts when a hurricane is predicted in Florida! So they ship this product to Florida stores based upon the weather forecast.
9. Quality Testing
Companies make thousands of products every day and each company must make sure that a good quality item is sold. But a company can't test each and every item that they ship to you, the consumer. So the company uses statistics to test just a few, called a sample, of what they make. If the sample passes quality tests, then the company assumes that all the items made in the group, called a batch, are good.
10. Stock Market
Another topic that you hear a lot about in the news is the stock market. Stock analysts also use statistical computer models to forecast what is happening in the economy.

Bibiliography:
http://www.mathworksheetscenter.com/mathtips/statsareimportant.html