Thursday 7 March 2013

Population and Samples




Statistics resolves around  the study of datasets - population and sample

A population is any complete group with at least one characteristic in common. Populations are not just people. Populations may consist of, but are not limited to, people, animals, businesses, buildings, motor vehicles, farms, objects or events. The population needs to be clearly identified at the beginning of a study. The study should be based on a clear understanding of who or what is of interest, as well as the type of information required from that population

A population is a group of phenomena that have something in common. The term often refers to a group of people, as in the following examples:
  • All registered voters in Crawford County
  • All members of the International Machinists Union
  • All Americans who played golf at least once in the past year
But populations can refer to things as well as people:
  • All widgets produced last Tuesday by the Acme Widget Company
  • All daily maximum temperatures in July for major U.S. cities
  • All basal ganglia cells from a particular rhesus monkey
Often, researchers want to know things about populations but do not have data for every person or thing in the population. If a company's customer service division wanted to learn whether its customers were satisfied, it would not be practical (or perhaps even possible) to contact every individual who purchased a product. Instead, the company might select a sample of the population. A sample is a smaller group of members of a population selected to represent the population. In order to use statistics to learn things about the population, the sample must be random. A random sample is one in which every member of a population has an equal chance of being selected. The most commonly used sample is a simple random sample. It requires that every possible sample of the selected size has an equal chance of being used.

A parameter is a characteristic of a population. A statistic is a characteristic of a sample. Inferential statistics enables you to make an educated guess about a population parameter based on a statistic computed from a sample randomly drawn from that population (see Figure 1)

Figure 1. Illustration of the relationship between samples and populations.
figure

For example, say you want to know the mean income of the subscribers to a particular magazine—a parameter of a population. You draw a random sample of 100 subscribers and determine that their mean income is $27,500 (a statistic). You conclude that the population mean income μ is likely to be close to $27,500 as well. This example is one of statistical inference.
Different symbols are used to denote statistics and parameters, as Table 1 shows.

Comparison of Sample Statistics and Population Parameters
Sample Statistic                       Population Parameter
Mean           equation                                  μ
Standard deviation           s                               sigma
Variance           s2                               sigma2

Probability Sampling Techniques

Probability sampling is a sampling technique where the samples are gathered in a process that gives all the individuals in the population equal chances of being selected.

Simple Random Sample

The simple random sample is the basic sampling method assumed in statistical methods and computations. To collect a simple random sample, each unit of the target population is assigned a number. A set of random numbers is then generated and the units having those numbers are included in the sample.

Random sampling with Replacement

Sampling is called with replacement when a unit selected at random from the population is returned to the population and then a second element is selected at random. Whenever a unit is selected, the population contains all the same units. A unit may be selected more than once. There is no change at all in the size of the population at any stage. We can assume that a sample of any size can be selected from the given population of any size

The number of samples is given by N power n

proc surveyselect data = hsb25 method = SRS rep = 1
                         sampsize = 10 seed = 12345 out = hsbs1;
  id _all_; <includes all columns in the random sample>
run;

where,
SRS - simple random sampling
sampsize - size of random sample

Random Sampling without Replacement

Sampling is called without replacement when a unit is selected at random from the population and it is not returned to the main lot. First unit is selected out of a population of size N and the second unit is selected out of the remaining population of N-1 units and so on. Thus the size of the population goes on decreasing as the sample size n increases. The sample size n cannot exceed the population size N. The unit once selected for a sample cannot be repeated in the same sample. Thus all the units of the sample are distinct from one another. A sample without replacement can be selected either by using the idea of permutations or combinations.

proc surveyselect data = hsb25 method = URS rep = 1
                         sampsize = 10 seed = 12345 out = hsbs1;
  id col1 col2 col3 ...col x; <includes only selected columns in the sample>
run;

where,
URS - Unrestricted random sampling


Systematic Sample

In a systematic sample, the elements of the population are put into a list and then every kth element in the list is chosen (systematically) for inclusion in the sample

Stratified Sample

A stratified sample is a sampling technique in which the researcher divided the entire target population into different subgroups, or strata, and then randomly selects the final subjects proportionally from the different strata

References:

http://www.cliffsnotes.com/study_guide/Populations-Samples-Parameters-and-Statistics.topicArticleId-267532,articleId-267478.html