Link Search Menu Expand Document

Project 04 - Exploratory Data Analysis

Due Date: Monday Oct 14 @ 6am Uploaded to Canvas

Overview

In a data science project, getting to know your data is usually one of the first steps performed. Sometimes this is referred to as Making Sense of the Data. In this project, you’ll create a program that will calculate some descriptive statistics and other analysis for various data sets.

A lot of data sets that you will encounter will be “rectangular”. Think of a spreadsheet for example. In a spreadsheet, we often think of each row as being data about a particular thing… a person, a review on a website, etc. Each column often represents the value of a characteristic of that thing. If each row represents a person, then there might be a column holding age, one holding phone number, one holding email address, etc. In Data Science, another word used for each row would be an observation. Another word used for each column is variable or feature.

For this project, you’ll only be working with a single variable/feature.

Descriptive Statistics

Descriptive statistics do just that… they describe a variable. If you’ve taken a stats course, you may have heard the phrase “measures of central tendency”. This includes the mean, median, and mode. You may have also be familiar with various ways of describing the distribution of the data. One traditional way is calculating the variance or standard deviation for the variable. Another way of communicating the distribution is to use a histogram.

Your Task

  1. Ask the user to enter a datafile name. You may assume that the data file is in the same folder as the python script. Therefore, you don’t need a full path. You can also assume that all values fall between 1 and 100 (inclusive).
  2. Open the file whose name the user provided in step 1. The file will contain a single column of numbers representing some variable. Read all of the numbers into a list (readlines()) and iterate over them to remove the trailing newline characters (strip()).
  3. Calculate the various descriptive statistics for the data set. Note: there are Python functions that will do this stuff for you. However, the purpose of this project is to get comfortable with looping and repetition. So, implement the functionality from scratch.
    • arithmetic mean
    • standard deviation
    • count the number of observations that fall into buckets of size 10 to create a histogram.
  4. Display the size of the data set, the mean, the standard deviation, and histogram to the user. Normally, histograms are displayed with the bars rising from the x axis. To make this a little easier, you may display the histogram with the bars starting from the left side of the graph. See the example below.
    • the size of a data set can be abbreviated with n
    • In the histogram, use an asterisk to display one observation. In the example below, the histogram is an example and doesn’t necessarily represent the same data as the count, mean, and std.dev above it.

Example Output

Description of data in input_file.txt
  n        = 86
  mean     = 73.5
  st. dev. = 8.3
  Histogram:
  ---------------------------------------------------------------------------
   01-10: **
   11-20: **
   21-30: *****
   31-40: ***********
   41-50: *************
   51-60: *******************
   61-70: ***************************
   71-80: ************************************
   81-90: ***********************
  91-100: ****************
  ----------------------------------------------------------------------------

Important Note Your project should work with any input file that is formatted according to the rules listed above. You can assume that any file used for grading with follow those stipulations. The histogram drawing imposes a practical limit on the max size of the data set. So, you don’t need to worry about scaling the sizes of the buckets.

Submission and Grading

Your project (single Python file) should be submitted to Canvas in the appropriate PA04 location.

It will be graded according to the following rubric:

  • Source Code Quality: 40%
    • Proper use of vertical white space
    • Proper use of horizontal white space
    • Understandable variable names
    • File header comments with your name, SMU ID, and a brief explanation of the project.
    • Proper use of loops to calculate the mean and the standard deviation.
  • Correctness: 30%
    • Does your properly calculate from the data the mean and standard deviation?
    • Does it properly count the items that belong to each bucket?
  • User Interface: 30%
    • Is the prompt for the file name reasonable??
    • Is the output nicely formatted?
    • Are the floating point values limited to a reasonable number of decimal places of precision?