Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy. See our Privacy Policy and User Agreement for details. Published on Nov 17, Categorical data analysis using sas third edition pdf. SlideShare Explore Search You.

Submit Search. Successfully reported this slideshow. We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime. Upcoming SlideShare. Like this document? Why not share! Embed Size px. Start on. Show related SlideShares at end. WordPress Shortcode. DeannBertrand Follow. Published in: Science. There are two categories under the sex variable: male and female. The frequencies associated with each category, that is, for males and 80 for females, are categorical data.

Examples of categorical data or categorical variables abound around you: the enrollment figure of students in an art class over several semesters, tabulation of students' academic standings, the proportion of calories you consume daily in each food category, the presence or absence of HIV in teenagers of various ethnic backgrounds, and so on. In general, categorical data convey frequency information, proportion information, or the presence or absence of a particular outcome in an observation.

According to Stevens' theory of measurement, categorical variables are measured primarily at the nominal level. Occasionally, variables measured at the ordinal level can also be regarded as categorical variables. For example, in a statistics class, students' grades are assigned according to the normal curve.

The grading scale, from A highest to F lowest , definitely conveys an ordinal type of information. It can also be treated as a nominal-level categorical variable for which the frequency of students receiving one of these five grades is tabulated and analyzed. It is useful for tabulating frequencies of occurrences in each category, while simultaneously converting frequencies into proportions. A one-way table refers to a display of frequencies based on a single categorical variable Example A two-way table is a tabulation of joint frequencies of two variables Example Usually, a two-way table uses one dimension, such as columns, to represent one variable and another dimension, such as rows, to represent the second variable.

Likewise, a multiway table involves three or more categorical variables Examples A multi-way table is presented on the output as a two-way table for which columns and rows are categories of the last two variables, [Page ] while the first variable or the second,…, and so on is fixed at a level or a category.

For two-way and multi-way tables, it can be specified to compute the degree of association between two or more categorical variables in a series of indices Example It can test these indices to determine if the null hypothesis of no association is tenable Examples The kappa index of agreement and McNemar's test are demonstrated also for 2x2 tables in Example Example The information was retrieved for the period of to There are altogether records in the data set.

Ethnic background of students was also included.

## Statistics Books for Loan

Raw data were grouped data; hence, each record is a summary of information on a course by semester, year, grade, ethnic group, and its frequency. A series of analyses will be performed on the data to reveal facets of the grade distributions that had caused concern—more will be said about this concern later. The first attempt in understanding this data set is to tabulate the grade distribution across all courses over the period of to This statistic is called Pearson chi-square. Values of the freq variable are the frequency of occurrences of each data record.

These frequencies act as weights in determining the total sample size and the size of each category refer to Section The resulting output reveals six categories of grades, as expected. The majority of students received a B. Let's suppose that, in addition to grade, you are also interested in the distribution of race and enrollments in different courses.

You may modify Example Do you want to know the reason for analyzing these data? The Affirmative Action Office of this state university was concerned with grades awarded to students of various racial backgrounds, especially minorities. To shed light on this concern, it was decided that a joint distribution based on both grade and race needed to be examined carefully.

This two-way table could be constructed by specifying both variables on the TABLES statement, with an asterisk connecting both:. What if grade distributions were also needed across different courses? This calls for a second two-way table of grade by course. Because the variable grade is specified in both two-way tables, you may combine two TABLES statements into one as follows. This output spreads across two pages, each corresponding to one two-way table.

The first two-way table is based on the grade distribution by race categories. In this table, six categories of grade are rows, whereas four categories of race are columns. The second two-way table is based on grade by course. Inside each cell, four numbers are shown. However, this number 15 does not necessarily represent 15 distinct individuals because some students took more than one course during the period of this study and could be awarded an A more than once.

The second number 0. The total sample size is printed at the lower right corner of the table. The third number 0. The fourth number 6. The column percentage describes the proportion of African Americans receiving an A out of a total of African Americans. Hence, 6. The numbers printed outside the table are totals. For example, 2,, 3,, and so on are row totals. Likewise, , , and so on are column totals. The second number directly beneath these numbers is the percentage.

Hence, Can you verify 3. Now let's get into the nitty-gritty of the data in light of the concern expressed by the Affirmation Action Office. To what extent is the grade distribution of each racial group the same across different courses?

- How to Prepare, Stage, and Deliver Winning Presentations!
- More titles to consider;
- Stay ahead with the world's most comprehensive technology and business learning platform.?
- The Borrowing Money Guide: A How-To Book for Consumers.
- Account Options.
- The First World War (4): The Mediterranean Front 1914-1923 (Essential Histories, Volume 23).

To answer this question, we need to further investigate the breakdown of grade distributions based on the variables grade, race , and course , that is, a three-way frequency table. The syntax of a two-way table is easily extended to a three-way table by inserting one more asterisk to link these three variables together.

### Stay ahead with the world's most comprehensive technology and business learning platform.

Since only two-dimensional tables are printed on each page, PROC FREQ prepares the three-way table by a series of two-way tables, each corresponding to one category of the first variable i. For each course, a two-way table based on grade and race is presented. Consequently, a total of 3 two-way tables are displayed, one on each page. Let's now turn our attention to the second two-way table based on the course X You would probably have noticed that the layout is strikingly similar to the first two-way table obtained in Output Both two-way tables display grade profiles for four ethnic groups of students.

The only difference is that the current two-way table draws data from students enrolled in X between and , whereas Output You can apply what you already know about the two-way table to this one and the other two also. Remember, each two-way table on this output is associated with a particular course; thus, it constitutes a subset of the original data. This presentation format does not reveal the complex relationship among variables. To succinctly reveal the relationship among grade, race , and course in a 3-D display, we simplify the original data set.

The dichotomous coding of these variables made sense to the Affirmative Action Office because it was concerned with a seemingly higher rate of failing X and X courses by African American and Hispanic students, but not by Asian American or White students. The specification of the SPARSE option is to request that all possible combinations of year, course, race2 , and grade2 be presented even if a combination does not correspond to an occurrence. Aren't you impressed by the visual effect of these six block charts?

As a matter of fact, they portray the reason why the Affirmative Action Office was contacted in the first place. It was suspected that in X and X, African American and Hispanic students were unfairly graded and their chances of passing these courses were not as good as those of Asian Americans or Whites. According to the pf data set, one African American student enrolled in X and two in X in Yet Output This trend is not observed in X, contrary to the suspicion of the Affirmative Action Office of this university. These phenomena are investigated further in Examples This important question takes the previous six block charts to a higher ground—the data will be subject to statistical tests to determine if the grade distribution is similar across racial groups.

To carry out the tests, data are simplified as in Example The statistical analyses performed on the two new variables include 1 a chi-square test of independent relationship between these two variables and 2 descriptions of any relationship between them. The chi-square test is particularly suitable for examining whether the grades assigned to the students were dependent on their race. This test is requested by the option CHISQ and is defined identically as the chi-square test of goodness-of-fit discussed in Example The approximation gets better with increasing sample sizes.

Exactly how large is large enough remains a debatable issue. We will deal with the issue of small samples and small cell sizes in Example The description of any relationship, if it exists, is given by the option MEASURES which, as the name implies, provides multiple indices of the strength of the relationship between race2 and grade2. Output Part A of Output Part B presents statistical test results to help you determine if there is sufficient evidence to refute the null hypothesis of statistical independence between race2 and grade2. As it turns out, there are more results than you bargained for.

This statistic equals The heading Prob means p level or significance level , as they are often referred to in statistics textbooks. The second test provided in Part B is called Likelihood Ratio Chi-Square , and it also has 1 degree of freedom and a significant p level of less than 0. The likelihood ratio chi-square test is based on the natural logarithm of observed frequency over expected frequency. The small p level of this test once again provides evidence to reject the null hypothesis of an independent relationship.

The alternative hypothesis asserts that there is an association between grades assigned and students' racial background. You can ignore the fourth chi-square, or the Mantel-Haenszel statistic, because this test requires that both column and row variables be on the ordinal scale.

And as the dichotomized race2 in these data is a nominal-level measure, you can discard this result. The rest of the information in Part B describes the strength of the relationship between race2 and grade2. The Phi, Contingency, and Cramer's V coefficients are all derived from the Pearson chi-square statistic. When most data points are classified into the diagonal cells, the phi coefficient is positive. When most data points are in the off-diagonal cells, the phi coefficient is negative.

Hence, the current phi coefficient indicates a negative, but moderate, degree of association between race2 and grade2. Cramer's V coefficient is also called a rescaled phi coefficient because it is intended to correct for the theoretical upper limit of the squared phi coefficient.

## Sas commands

Because the upper limit of the squared phi coefficient equals 1 in this case, these two values should be equal. The second index of association is the Contingency Coefficient. This index may be discarded because it could not achieve a maximum value of 1. So who needs this?

- Tigers in Red Weather.
- Browse more videos.
- Researching Learning in Higher Education: An Introduction to Contemporary Methods and Approaches (SEDA Series).
- Join Kobo & start eReading today?
- Categorical Data Analysis Using SAS, Third Edition, 3rd Edition [Book].
- 2. Primary datasets:!
- Baking with Sourdough (Storeys Country Wisdom Bulletin A-50).

Fisher's exact test is suitable for small samples and small cell sizes. A detailed discussion of this exact test is provided in Example Part D can be ignored because these indices are suitable only for ordered categories or ordinal variables. It would be far-fetched to assume that the current row and column variables were measured on an ordinal scale.

An odds ratio of this magnitude indicates a strong association between the row variable race2 and the column variable grade2 because it deviates noticeably from 1, which indicates no association between these two variables. These two risk measures are suitable for cohort studies in which the two groups are identified on the basis of an explanatory variable presence or absence of a defect gene, for example. The binary outcome color blindness or normal vision is observed for both groups, and their relative risks, as conditioned on both group sizes, are computed for each outcome.

Obviously, all three measures in Part E are relevant and appropriate for epidemiological or public health studies. Thus, no further explanation is given here. So what exactly is the relationship? We can say that the association is beyond chance alone, yet it is not strong. Part A results show that In Example The emphasis in that example was on the description of a three-way relationship between the race of students and the students' grades in several courses taken from to We now revisit this three-way table with the purpose of testing if a three-way relationship exists among these three variables.

The statistical test suitable for this purpose is the Cochran-Mantel- Haenszel test. The test is requested by the keyword CMH; it is an abbreviation of the names of the three statisticians who invented this test. The data are simplified as in Example The rest of the program is self- explanatory I hope!

This output is divided into three parts. Part A presents 3 two-way tables. Each two-way table is a grade2 distribution of students classified by race2 in a particular course. Therefore, 3 two-way tables are constructed. Part B presents a series of tests designed to uncover any relationship among race2, grade2 , and course. This particular test requires both race2 and grade2 to be measured at the ordinal level or higher.

Because it is difficult to justify either race2 or grade2 to be an ordinal-level variable, such a test is waived from further interpretation. The second CMH statistic tests whether the average grade2 score is the same across all race2 categories in all courses.

- Introducing Modernism: A Graphic Guide?
- Alan Agresti (2013): Categorical data analysis.
- Why should you use Wordery.
- Against the vigilantes: the recollections of Dutch Charley Duane?
- Categorical Data Analysis - SAGE Research Methods.
- Essentials of E-Learning for Nurse Educators?

The alternative hypothesis states that at least for one course, the mean score of grade2 is different for diverse race2 categories. Hence, the significant result of the second CMH is worth paying attention to. The significance level Prob on the printout is less than 0. To better understand this rather general conclusion, let's go back to the data and investigate further.