Comment from the Stata technical groupThe Workflow of Data Analysis Using Stata, by J. Scott Long, is a productivity tool for data analysts. Long guides you toward streamlining your workflow, because a good workflow is essential for replicating your work, and replication is essential for good science. A workflow of data analysis is a process for managing all aspects of data analysis. Planning, documenting, and organizing your work; cleaning the data; creating, renaming, and verifying variables; performing and presenting statistical analyses; producing replicable results; and archiving what you have done are all integral parts of your workflow. Long shows how to design and implement efficient workflows for both one-person projects and team projects. An efficient workflow reduces the time you spend doing data management and lets you produce datasets that are easier to analyze. When you methodically clean your data and carefully choose names and effective labels for your variables, the time you spend doing statistical and graphical analyses will be more productive and more enjoyable. After introducing workflows and explaining how a better workflow can make it easier to work with data, Long describes planning, organizing, and documenting your work. He then introduces how to write and debug Stata do-files and how to use local and global macros. Long presents conventions that greatly simplify data analysis—conventions for naming, labeling, documenting, and verifying variables. He also covers cleaning, analyzing, and protecting your data. While describing effective workflows, Long also introduces the concepts of basic data management using Stata and writing Stata do-files. Using real-world examples, Stata commands, and Stata scripts, Long illustrates effective techniques for managing your data and analyses. If you analyze data, this book is recommended for you. Table of contentsList of tables
List of figures
List of examples
Preface
A word about fonts, files, commands, and examples
1 Introduction
1.1 Replication: The guiding principle for workflow
1.2 Steps in the workflow
1.2.1 Cleaning data
1.3 Tasks within each step 1.2.2 Running analysis 1.2.3 Presenting results 1.2.4 Protecting files
1.3.1 Planning
1.4 Criteria for choosing a workflow 1.3.2 Organization 1.3.3 Documentation 1.3.4 Execution
1.4.1 Accuracy
1.5 Changing your workflow 1.4.2 Efficiency 1.4.3 Simplicity 1.4.4 Standardization 1.4.5 Automation 1.4.6 Usability 1.4.7 Scalability 1.6 How the book is organized 2 Planning, organizing, and documenting
2.1 The cycle of data analysis
2.2 Planning 2.3 Organization
2.3.1 Principles for organization
2.4 Documentation 2.3.2 Organizing files and directories 2.3.3 Creating your directory structure
A directory structure for a small project
2.3.4 Moving into a new directory structure (advanced topic) A directory structure for a large, one-person project Directories for collaborative projects Special-purpose directories Remembering what directories contain Planning your directory structure Naming files Batch files
Example of moving into a new directory structure
2.4.1 What should you document?
2.5 Conclusions 2.4.2 Levels of documentation 2.4.3 Suggestions for writing documentation
Evaluating your documentation
2.4.4 The research log
A sample page from a research log
2.4.5 Codebooks A template for research logs
A codebook based on the survey instrument
2.4.6 Dataset documentation 3 Writing and debugging do-files
3.1 Three ways to execute commands
3.1.1 The Command window
3.2 Writing effective do-files 3.1.2 Dialog boxes 3.1.3 Do-files
3.2.1 Making do-files robust
3.3 Debugging do-files
Make do-files self-contained
3.2.2 Making do-files legible Use version control Exclude directory information Include seeds for random numbers
Use lots of comments
3.2.3 Templates for do-files Use alignment and indentation Use short lines Limit your abbreviations Be consistent
Commands that belong in every do-file
A template for simple do-files A more complex do-file template
3.3.1 Simple errors and how to fix them
3.4 How to get help
Log file is open
3.3.2 Steps for resolving errors Log file already exists Incorrect command name Incorrect variable name Incorrect option Missing comma before options
Step 1: Update Stata and user-written programs
3.3.3 Example 1: Debugging a subtle syntax error Step 2: Start with a clean slate Step 3: Try other data Step 4: Assume everything could be wrong Step 5: Run the program in steps Step 6: Exclude parts of the do-file Step 7: Starting over Step 8: Sometimes it is not your mistake 3.3.4 Example 2: Debugging unanticipated results 3.3.5 Advanced methods for debugging 3.5 Conclusions 4 Automating your work
4.1 Macros
4.1.1 Local and global macros
4.2 Information returned by Stata commands
Local macros
4.1.2 Specifying groups of variables and nested models Global macros Using double quotes when defining macros Creating long strings 4.1.3 Setting options with locals
Using returned results with local macros
The foreach command
The forvalues command
4.3.1 Ways to use loops
4.4 The include command
Loop example 1: Listing variable and value labels
4.3.2 Counters in loops Loop example 2: Creating interaction variables Loop example 3: Fitting models with alternative measures of education Loop example 4: Recoding multiple variables the same way Loop example 5: Creating a macro that holds accumulated information Loop example 6: Retrieving information returned by Stata
Using loops to save results to a matrix
4.3.3 Nested loops 4.3.4 Debugging loops
4.4.1 Specifying the analysis sample with an include file
4.5 Ado-files 4.4.2 Recoding data using include files 4.4.3 Caution when using include files
4.5.1 A simple program to change directories
4.6 Help files 4.5.2 Loading and deleting ado-files 4.5.3 Listing variable names and labels 4.5.4 A general program to change your working directory 4.5.5 Words of caution
4.6.1 nmlabel.hlp
4.7 Conclusions 4.6.2 help me 5 Names, notes, and labels
5.1 Posting files
5.2 The dual workflow of data management and statistical analysis 5.3 Names, notes, and labels 5.4 Naming do-files
5.4.1 Naming do-files to re-create datasets
5.5 Naming and internally documenting datasets 5.4.2 Naming do-files to reproduce statistical analysis 5.4.3 Using master do-files
Master log files
5.4.4 A template for naming do-files
Using subdirectories for complex analysis
Never name it final!
5.5.1 One time only and temporary datasets
5.6 Naming variables 5.5.2 Datasets for larger projects 5.5.3 Labels and notes for datasets 5.5.4 The datasignature command
A workflow using the datasignature command
Changes datasignature does not detect
5.6.1 The fundamental principle for creating and naming variables
5.7 Labeling variables 5.6.2 Systems for naming variables
Sequential naming systems
5.6.3 Planning names Source naming systems Mnemonic naming systems 5.6.4 Principles for selecting names
Anticipate looking for variables
Use simple, unambiguous names Try names before you decide
5.7.1 Listing variable labels and other information
5.8 Adding notes to variables
Changing the order of variables in your dataset
5.7.2 Syntax for label variable 5.7.3 Principles for variable labels
Beware of truncation
5.7.4 Temporarily changing variable labels Test labels before you post the file 5.7.5 Creating variable labels that include the variable name
5.8.1 Commands for working with notes
5.9 Value labels
Listing notes
5.8.2 Using macros and loops with notes Removing notes Searching notes
5.9.1 Creating value labels is a two-step process
5.10 Using multiple languages
Step 1: Defining labels
5.9.2 Principles for constructing value labels Step 2: Assigning labels Why a two-step system? Removing labels
1) Keep labels short
5.9.3 Cleaning value labels 2) Include the category number 3) Avoid special characters 4) Keeping track of where labels are used 5.9.4 Consistent value labels for missing values 5.9.5 Using loops when assigning value labels
5.10.1 Using label language for different written languages
5.11 A workflow for names and labels 5.10.2 Using label language for short and long labels
Step 1: Plan the changes
Step 2: Archive, clone, and rename Step 3: Revise variable labels Step 4: Revise value labels Step 5: Verify the changes
5.11.1 Step 1: Check the source data
5.12 Conclusions
Step 1a: List the current names and labels
5.11.2 Step 2: Create clones and rename variables Step 1b: Try the current names and labels
Step 2a: Create clones
5.11.3 Step 3: Revise variable labels Step 2b: Create rename commands Step 2c: Rename variables
Step 3a: Create variable-label commands
5.11.4 Step 4: Revise value labels Step 3b: Revise variable labels
Step 4a: List the current labels
5.11.5 Step 5: Check the new names and labels Step 4b: Create label define commands to edit Step 4c: Revise labels and add them to dataset 6 Cleaning your data
6.1 Importing data
6.1.1 Data formats
6.2 Verifying variables
ASCII data formats
6.1.2 Ways to import data Binary-data formats
Stata commands to import data
6.1.3 Verifying data conversion Using other statistical packages to export data Using a data conversion program
Converting the ISSP 2002 data from Russia
6.2.1 Values review
6.3 Creating variables for analysis
Values review of data about the scientific career
6.2.2 Substantive review Values review of data on family values
What does time to degree measure?
6.2.3 Missing-data review Examining high-frequency values Links among variables Changes in survey questions
Comparisons and missing values
6.2.4 Internal consistency review Creating indicators of whether cases are missing Using extended missing values Verifying and expanding missing-data codes Using include files
Consistency in data on the scientific career
6.2.5 Principles for fixing data inconsistencies
6.3.1 Principles for creating new variables
6.4 Saving datasets
New variables get new names
6.3.2 Core commands for creating variables Verify that new variables are correct Document new variables Keep the source variables
The generate command
6.3.3 Creating variables with missing values The clonevar command The replace command 6.3.4 Additional commands for creating variables
The recode command
6.3.5 Labeling variables created by Stata The egen command The tabulate, generate() command 6.3.6 Verifying that variables are correct
Checking the code
Listing variables Plotting continuous variables Tabulating variables Constructing variables multiple ways
6.4.1 Selecting observations
6.5 Extended example of preparing data for analysis
Deleting cases versus creating selection variables
6.4.2 Dropping variables
Selecting variables for the ISSP 2002 Russian data
6.4.3 Ordering variables 6.4.4 Internal documentation 6.4.5 Compressing variables 6.4.6 Running diagnostics
The codebook, problem command
6.4.7 Adding a data signature Checking for unique ID variables 6.4.8 Saving the file 6.4.9 After a file is saved
Creating control variables
Creating binary indicators of positive attitudes Creating four-category scales of positive attitudes
6.6.1 Match-merging
6.7 Conclusions
Sorting the ID variable
6.6.2 One-to-one merging
Combining unrelated datasets
6.6.3 Forgetting to match-merge 7 Analyzing data and presenting results
7.1 Planning and organizing statistical analysis
7.1.1 Planning in the large
7.2 Organizing do-files 7.1.2 Planning in the middle 7.1.3 Planning in the small
7.2.1 Using master do-files
7.3 Documentation for statistical analysis 7.2.2 What belongs in your do-file?
7.3.1 The research log and comments in do-files
7.4 Analyzing data using automation 7.3.2 Documenting the provenance of results
Captions on graphs
7.4.1 Locals to define sets of variables
7.5 Baseline statistics 7.4.2 Loops for repeated analyses
Computing t tests using loops
7.4.3 Matrices to collect and print results Loops for alternative model specifications
Collecting results of t tests
7.4.4 Creating a graph from a matrix Saving results from nested regressions Saving results from different transformations of articles 7.4.5 Include files to load data and select your sample 7.6 Replication
7.6.1 Lost or forgotten files
7.7 Presenting results 7.6.2 Software and version control 7.6.3 Unknown seed for random numbers
Bootstrap standard errors
7.6.4 Using a global that is not in your do-file Letting Stata set the seed Training and confirmation samples
7.7.1 Creating tables
7.8 A project checklist
Using spreadsheets
7.7.2 Creating graphs Regression tables with esttab
Colors, black, and white
7.7.3 Tips for papers and presentations Font size
Papers
Presentations 7.9 Conclusions 8 Protecting your files
8.1 Levels of protection and types of files
8.2 Causes of data loss and issues in recovering a file 8.3 Murphy’s law and rules for copying files 8.4 A workflow for file protection
Part 1: Mirroring active storage
Part 2: Offline backups 8.6 Conclusions 9 Conclusions
A How Stata works
A.1 How Stata works
Stata directories
The working directory A.3 Customizing Stata
A.3.1 Fonts and window locations
A.4 Additional resources A.3.2 Commands to change preferences
Options that can be set permanently
A.3.3 profile.do Options that need to be set each session
Function keys
References
Author index
Subject index
|