Stata Guide for Intermediate Users
Author: Alicia Liu '24
Paths and Directories
One of the first commands you should run in Stata and in basically any program is something that sets your directory. Think of a directory as where your program thinks it and everything is in (code files do not have to be in the same location as the directory).
In Stata, this is done by using the command
cd “path/name”
If this command runs, this means that the above location, “path/name”, is where your program will go to save files and to find files (unless specified).
A path is a string that indicates where a folder/file is located. On Mac, it can be copied by right-clicking (control ^ + click) and while still holding the ^ key, press options, where you will see an option called “Copy as Pathname”. Sometimes, you can click on an object and do the standard copy command (Cmd + C ), and it will paste the pathname.
Paths must always be in “” (quotation marks).
Preserve/Restore
Let’s say you need to do some data cleaning that alters the original data, perhaps collapsing county level data on the state level for another analysis, but you still want to maintain the original data.
Preserve/restore is a method that allows you to alter data and then revert it back to its original state.
IMPORTANT NOTE: You must run the entire block, from preserve to restore, for preserve/restore to work.
Another method that allows you to alter data and then revert back to its original state is frames. For many users, frames are likely much more useful. Preserve/restore does have an advantage when used in loops.
Frames
Frames allow you to view multiple datasets in memory in Stata. One of its biggest advantages is that you can move back and forth between frames. This fixes one of the issues with preserve/restore where all code had to be in between preserve and restore.
Frames are useful for many scenarios, but these are the most common.
- Multi-task
- Work on separate but related datasets at the same time
- Test some analyses without changing your data
Basic commands
frame create newframe
- This creates a new frame. Here you can use a completely new dataset.
frame change framename1
- By default, the default frame is the default. This command lets you switch between frames.
frame copy framename1 framename2
- This lets you copy the data from one frame to another frame.
frame drop framename
- This lets you drop a frame if you don’t need it anymore.
- NOTE: Frames stay in memory and only drop with this command or if you close Stata.
frame rename oldname newname
- Rename frames.
frames dir
- This shows you all of the frames you have open.
- An asterisk indicates if the data in the frame has not been saved
- If the data came from a .dta file, the name of the file will be shown in the directory.
Where to Find Help
Some helpful resources in troubleshooting with Stata include
- Help guides/official documentation (they are hard to read in the beginning but can be immensely helpful)
- This will have the most in depth information on most, if not all, Stata commands
- There will be some examples to explain how this
- Can either run “help commandname” in the terminal or just Google the command name + “help guide”
- UCLA Office of Advanced Research Computing Stata Guides
- Princeton University Data Analysis Library Guides
- Other than the ones listed here, many research universities have coding guides that are generally easier to understand than official documentation and forums.
- Statalist
- Stack Exchange for Stata
- If you have a question, there’s a really high chance someone else encountered the same situation
- Us! Your TAs, professors, etc.!
Using ChatGPT and other LLMs
The way ChatGPT and other generative large language models (LLMs) work is that they consume lots of information to produce output. Unfortunately, relative to other languages, there isn’t that much good Stata out there for GPT to learn from. This means that when you ask GPT (or Bard or …), the output has a higher chance of being incorrect. Sometimes, LLMs can make things up, which is called “hallucinating.”
The issue is that you can only verify this output (other than running it and receiving an error) if you know the language well enough. So, if you’re learning Stata, I’d highly recommend not using ChatGPT or similar LLMs or using it as little as possible.
Note: this issue about lots of incorrect output does not apply to other languages such as R and Python. Github has provided the Copilot extension for VSCode and RStudio, and you can get it for free if you sign up for the Github Student developer pack (see details in using GPT integrations guide).