STATA 101

Alosias A, Behavioral Development Lab


What is Stata?

Stata is a general‐purpose statistical package which is broadly used in social sciences. Among its most important capabilities we find data management, statistical analysis, simulations, graphics, and custom programming.

The Basics: How the stata will look like?

As you will immediately notice when Stata starts up, there are five windows arranged as follows:
COMMAND: You can tell Stata what to do by typing in commands. Click inside the command window and type display “Hello”
RESULTS: Here Stata displays the commands followed by the output that Stata has produced. Note what appeared as the result of the command you just typed in.
VARIABLES: lists all the variables in the dataset. The variable window can act as a shortcut for creating commands. Try clicking on one of the variables. It should appear in the command window, eliminating the need for you to write it out.
REVIEW: Lists all your prior commands. Notice that display “Hello!” now appears there. You can click on it and it will appear in your command window.
Useful tip: When you are in the command window, you can scroll through your previous commands using the Page Up and Page Down buttons on your keyboard.
DATA EDITOR:Now let’s look at what a dataset stored in Stata actually looks like (the way Stata sees it and reads it). Type browse in the command window.
DO-FILE: Despite we can type codes directly in the Command window, it is a better idea to write all commands in the same file so be can save and reproduce them in the future. The set of commands we want Stata to execute can be saved in what Stata calls “do‐files”. Then, we will open the Do‐file editor by clicking the icon at the top of the screen.
We can start by typing clear all set more off so we clear our workspace and tell Stata to allow our screen to scroll (otherwise Stata will wait until we hit the button "more" to proceed with computations).


Some Basic Comments

1. Select directory
The first step will be to select the directory where we are going to work (i.e., the folder with the dataset and where Stata is going to save the files resulting from our work). Then, we are going to use the change directory command cd:

cd “c:/path” for Windows, or
cd “/Users/user_name/path” for Mac OS
Useful tip: When you are in the command window, you can type cd or pwd to know your current directory.
2. Importing Datasets into STATA
(i) To import datasets from STATA Example Datasets - use the command sysuse
E.g. sysuse auto.dta, clear -"auto" is one of the example datasets from STATA.

(ii) To import datasets from STATA Online Example Datasets - use the command webuse
E.g. webuse auto.dta, clear

(iii) To import datasets from local directory - use the command use
E.g. use "$dir/file.dta", clear

(iv) To import .csv file from local directory - use the command import delimited
E.g. import delimited "$dir/csv_file.csv", clear

(v) To import .xls or .xlsx file from local directory - use the command import excel
E.g. import excel "$dir/xlsx_file.xlsx", firstrow sheet(sheet_name) clear - "firstrow" is to use 1st row of the file as variables, "sheet()" is to import the particular sheet.
Useful tip: Use clear when you import the datasets to clear the previous memory.
3. Explore and summarize data
Please use the STATA example datasets for this exercise. The sysuse command loads a specified Stata-format dataset that was shipped with Stata. Here we will use the auto data file. E.g. sysuse auto.dta, clear

(i) describe
The describe command shows you basic information about a Stata data file. As you can see, it tells us the number of observations in the file, the number of variables, the names of the variables, and more.

describe

 Contains data from auto.dta
  obs:            74                          
 vars:            12                          17 Feb 1999 10:49
 size:         3,108 (99.6% of memory free)
-------------------------------------------------------------------------------
   1. make      str17  %17s                   
   2. price     int    %9.0g                  
   3. mpg       byte   %9.0g                  
   4. rep78     byte   %9.0g                  
   5. hdroom    float  %9.0g                  
   6. trunk     byte   %9.0g                  
   7. weight    int    %9.0g                  
   8. length    int    %9.0g                  
   9. turn      byte   %9.0g                  
  10. displ     int    %9.0g                  
  11. gratio    float  %9.0g                  
  12. foreign   byte   %9.0g                  
-------------------------------------------------------------------------------
Sorted by:  


(ii) codebook
The codebook command is a great tool for getting a quick overview of the variables in the data file. It produces a kind of electronic codebook from the data file. Have a look at what it produces below.

codebook make price

 make -------------------------------------------------------------- (unlabeled)
                  type:  string (str17)

         unique values:  74                   coded missing:  0 / 74

              examples:  "Cad. Deville"
                         "Dodge Magnum"
                         "Merc. XR-7"
                         "Pont. Catalina"

               warning:  variable has embedded blanks

price ------------------------------------------------------------- (unlabeled)
                  type:  numeric (int)

                 range:  [3291,15906]                 units:  1
         unique values:  74                   coded missing:  0 / 74

                  mean:   6165.26
              std. dev:    2949.5

           percentiles:        10%       25%       50%       75%       90%
                              3895      4195    5006.5      6342     11385

 


(iii) inspect
Another useful command for getting a quick overview of a data file is the inspect command. Here is what the inspect command produces for the auto data file.

inspect mpg

mpg:                                           Number of Observations
------                                                             Non-
                                               Total   Integers    Integers
|      #                        Negative           -         -          -
|      #                        Zero               -         -          -
|      #                        Positive          74        74          -
|  #   #                                       -----     -----      -----
|  #   #   #                    Total             74        74          -
|  #   #   #   #   .            Missing            -
+----------------------                        -----
12                   41                           74
  (21 unique values)


(iv) list
The list command is useful for viewing all or a range of observations. Here we look at make, price, mpg, rep78 and foreign for the first 10 observations.

list make price mpg rep78 foreign in 1/10

                   make      price        mpg      rep78    foreign 
  1.      Dodge Magnum       5886         16          2          0  
  2.        Datsun 510       5079         24          4          1  
  3.      Ford Mustang       4187         21          3          0  
  4.  Linc. Versailles      13466         14          3          0  
  5.     Plym. Sapporo       6486         26          .          0  
  6.       Plym. Arrow       4647         28          3          0  
  7.     Cad. Eldorado      14500         14          2          0  
  8.        AMC Spirit       3799         22          .          0  
  9.    Pont. Catalina       5798         18          4          0  
 10.        Chev. Nova       3955         19          3          0   


(v) tabulate
(a) The tabulate command is useful for obtaining frequency tables. Below, we make a table for rep78 and a table for foreign. The command can also be shortened to tab.

tabulate rep74
tab rep74

      rep78 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.90        2.90
          2 |          8       11.59       14.49
          3 |         30       43.48       57.97
          4 |         18       26.09       84.06
          5 |         11       15.94      100.00
------------+-----------------------------------
      Total |         69      100.00 
tabulate foreign or tab foreign
     foreign |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         52       70.27       70.27
          1 |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00 


(b) The tab1 command can be used as a shortcut to request tables for a series of variables (instead of typing the tabulate command over and over again for each variable of interest).

tab1 rep74 foreign

 -> tabulation of rep78  

      rep78 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.90        2.90
          2 |          8       11.59       14.49
          3 |         30       43.48       57.97
          4 |         18       26.09       84.06
          5 |         11       15.94      100.00
------------+-----------------------------------
      Total |         69      100.00

-> tabulation of foreign  

    foreign |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         52       70.27       70.27
          1 |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00 


(c) We can use the plot option to make a plot to visually show the tabulated values.

tabulate rep78, plot

 
      rep78 |      Freq.
------------+------------+-----------------------------------------------------
          1 |          2 |**
          2 |          8 |********
          3 |         30 |******************************
          4 |         18 |******************
          5 |         11 |***********
------------+------------+-----------------------------------------------------
      Total |         69  


(d) We can also make crosstabs using tabulate. Let’s look at the repair history broken down by foreign and domestic cars.

tabulate rep78 foreign

           |        foreign
     rep78 |         0          1 |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
         2 |         8          0 |         8 
         3 |        27          3 |        30 
         4 |         9          9 |        18 
         5 |         2          9 |        11 
-----------+----------------------+----------
     Total |        48         21 |        69  


(vi) summarize
(a) For summary statistics, we can use the summarize command. Let’s generate some summary statistics on mpg.

summarize mpg

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
     mpg |      74     21.2973   5.785503         12         41   


(b) We can use the detail option of the summarize command to get more detailed summary statistics.

summarize mpg, detail

                              mpg
-------------------------------------------------------------
      Percentiles      Smallest
 1%           12             12
 5%           14             12
10%           14             14       Obs                  74
25%           18             14       Sum of Wgt.          74

50%           20                      Mean            21.2973
                        Largest       Std. Dev.      5.785503
75%           25             34
90%           29             35       Variance       33.47205
95%           34             35       Skewness       .9487176
99%           41             41       Kurtosis       3.975005 


(vii) mean
Let's generate the Mean, Std.Err, Conf.Interval (95%) for price and mpg

mean price mpg

Mean estimation                   Number of obs   =         74

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       price |   6165.257   342.8719      5481.914      6848.6
         mpg |    21.2973   .6725511       19.9569    22.63769
--------------------------------------------------------------


(viii) sort
The sort arranges the observations of the current data into ascending order based on the values of the variables in varlist.

sort price

     +-------+        +-------+
     | price |        | price |
     |-------|        |-------| 
  1. | 4,099 |     1. | 3,291 |
  2. | 4,749 |     2. | 3,299 |
  3. | 3,799 | --> 3. | 3,667 |
  4. | 4,816 |     4. | 3,748 |
  5. | 7,827 |     5. | 3,798 |
     +-------+        +-------+
  Shown top 5 observations


(ix) gsort
The gsort arranges observations to be in ascending or descending order of the specified variables and so differs from sort in that sort produces ascending-order arrangements only.

gsort -price

     +-------+        +--------+
     | price |        | price  |
     |-------|        |--------| 
  1. | 4,099 |     1. | 15,906 |
  2. | 4,749 |     2. | 14,500 |
  3. | 3,799 | --> 3. | 13,594 |
  4. | 4,816 |     4. | 13,466 |
  5. | 7,827 |     5. | 12,990 |
     +-------+        +--------+
  Shown top 5 observations


(x) order
The order command is used to reorder variables in dataset. Here we specify variable to be a first variable

order foreign

     +---------------------------------+          +---------------------------------------+
     | make                price   mpg |          | foreign     make                price |
     |---------------------------------|          |---------------------------------------|
  1. | Cad. Seville       15,906    21 |       1. | Domestic    Cad. Seville       15,906 |
  2. | Cad. Eldorado      14,500    14 |       2. | Domestic    Cad. Eldorado      14,500 |
  3. | Linc. Mark V       13,594    12 |  -->  3. | Domestic    Linc. Mark V       13,594 |
  4. | Linc. Versailles   13,466    14 |       4. | Domestic    Linc. Versailles   13,466 |
  5. | Peugeot 604        12,990    14 |       5. | Foreign     Peugeot 604        12,990 |
     +---------------------------------+          +---------------------------------------+        


(xi) browse
Browse using Data Editor. The browse is a convenient alternative to list.

browse or br

(xii) generate
(a) The command generate creates a new variable. The values of the variable are specified by =exp.
generate or gen or g

E.g. generate new_var = 1

Operators is very important when we use if condition.

Operator    
&       and; 
|       or; 
!=      not equal to // use "~=" 
==      equal when using 'if'
>       more than // >= more than or equal to
<       less than // <= less than or equal to
_n      row number
_N      total observations
  E.g. gen new_var1 = 1 if foreign == 1
       gen new_var2 = 1 if foreign == 1 & rep78 == 3
       gen new_var3 = 1 if foreign == 1 & rep78 != 3 


(b) The command egen - Extensions to generate where depending on the function (fcn), arguments refers to an expression.

egen new_mean_var = mean(price)


(xiii) drop
Drop variables or observations. The command drop eliminates variables or observations from the data in memory.
drop price - to eliminate price variable from the dataset
drop price if price < 5000 - to eliminate price observations that are less than 5000 from the dataset

(xiv) keep
keep works the same way as drop, except that you specify the variables or observations to be kept rather than the variables or observations to be deleted.
keep price - to keep price variable only and drop all other variables from the dataset
keep price if price < 5000 - to keep price observations that are less than 5000

4. Exporting or save Datasets
(i) To export datasets to the local directory - use the command save
E.g. save "$dir/file.dta", replace

(ii) To export .csv file to the local directory - use the command export delimited
E.g. export delimited "$dir/csv_file.csv", replace

(iii) To export .xls or .xlsx file to the local directory - use the command export excel
E.g. export excel "$dir/xlsx_file.xlsx", replace
Source:
(1) UCLA: https://stats.idre.ucla.edu/stata/ 
(2) IPA: https://www.poverty-action.org/researchers/research-resources/stata-programs