Facebook Pixel Code
Banner image
Biostatistics Design and Analysis

Entering data in Excel

So you want to enter your data in Excel?

Statisticians have a love-hate relationship with Excel. The software is easy to use, most people have a version of it on their computer, and Excel files can be imported into most common statistical packages. That’s the love. The hate?  Excel files typically require considerable editing before they can be used for statistical analyses. The good news is that with proper planning, most of these issues can be dealt with upfront, saving you time and money that can be better spent elsewhere.

1. One sheet per file.
Because your data file will be imported into a statistical software package, please use only one sheet per file with no additional figures or comments. Each data sheet should be rectangular:  one observation per row, one variable per column.

2. Variable names
Avoid using spaces and special characters like {,@,!, #, $, %,^,&,*,(,),[,],}  when defining the variable name. Variable names should always begin with a letter.

Keep variable names short to reduce the amount of typing needed when writing programs. Variable names should be listed in row 1 of the spreadsheet and not across multiple rows.

Keep a ‘dictionary’ of variable names in a separate file, along with a more detailed description of what they represent (eg: variable name, description, and for categorical variables, define the individual categories). Variable names can be transposed when they are pasted so that, rather than appearing along a single row, they appear in a single column. You can then write the variable details in the subsequent columns, creating a document that can easily be printed out (it will be consulted often). To transpose the variable names, copy the variables in your spreadsheet, place your cursor in the upper left corner of a new workbook and choose “Paste Special” from the EDIT menu. Select the “Transpose” box and click on OK.

3. Dates
Dates in Excel can cause many problems. The most common error is the reversal of day and month, where January 12 becomes December 1. Such errors are difficult to pick up, unless the day is a number greater than 12. This type of error can happen in any data entry package, but with Excel different versions appear to have different defaults for date format, so that the data file that you’ve entered as dd/mm/yy and have carefully double checked is full of errors when read by someone whose Excel software uses mm/dd/yy as the default (Note:  This may not be the case with the more recent versions of Excel). The statistical packages can complicate this even further, trying to convert Excel dates in all sorts of strange and unusable ways. The solution, while a bit clumsier to enter, will save you substantial time and data checking in the long run:  enter the day, month, and year for all dates in separate columns. The variable BDATE for birth date then becomes the 3 variables: BDAY, BMONTH, and BYEAR. The separate variables can always be combined into one date variable in any statistical package, including in Excel, by using the DATE() function.

Note to SPSS users:  Not to worry, you can enter dates in the normal way if you are entering data in SPSS!

4. Variable type
Do not include text comments in cells that should only include numeric data. When statistical packages identify the non-numeric characters, the entire variable is classified as a character variable and as such, cannot be used by functions that calculate summary statistics like means, variances, etc.

Even a single space preceding a number or alone in a cell can cause the entire variable to be read in as a character variable.

5. Age and duration
Always ensure that the date variables required to calculate age or duration (time since last visit, duration of treatment, etc.) are available. These values can easily be calculated using formulas based on the dates; data entry errors are often picked up in this way. Similarly, when including values that were calculated from several variables, include the original variables if the data entry required is not too onerous (e.g.: If the outcome of interest is ‘total score’ which is the total of 5 subtests, include the individual subtest scores in the spreadsheet). Further data entry errors can often be identified by recalculating the final variables. Alternatively, you can enter a function in Excel to calculate the total score for you even as you are entering the data into the spreadsheet. You can then check the total score in the report/clinic file against the one calculated by Excel, and immediately correct any errors identified.

  A B C D E  
1 Test 1 Test 1 Test 3 Total TotalEQ  
2 35 42 54   =sum(A2:C2) Compare the value entered in cell D2 to the calculated value that appears in cell E2 to check for data entry errors

6. Missing values
Do not include any text in cells with missing values. If you need to know why information was missing, create another variable in which to list the different causes for missing data. You should communicate with the statistician as to how they want missing values to be coded.

7. MIxeD CaPiTalS
When entering categorical data, each category must be represented by a single, unique identifier. For example, when entering data for the variable Gender, you might choose to use the two values “Male” and “Female.” If, along the way, you drop the capitals, then your variable will have four distinct categories:  “Male,” “male,” ”Female,”  and “female.” Attempts to run a t-test to compare data across males and females will fail because there are more than two categories. Frequency tables will show four rows. To correct this problem after the fact, choose the “Replace” option from the EDIT menu and expand the menu by selecting the “options” button.  Check off the “Match case” option. You can then replace Male with male, or vice versa across the entire sheet or in single column by selecting that column only.

8. Setting up your data:  categorical variables
You will often know ahead of time all possible values that a categorical variable can take on. For the purpose of most research studies, sex will take on one of only two possible values. However, in some studies data are collected on variables where the number of possible values are unknown, or where subjects can have more than one value, e.g.: medication. If the complete list of medications taken by the subject is typed in as a single variable, this information will not be usable in any analyses. If the information is important, then each possible medication or class of medication (although information is lost by grouping, sometimes this is inevitable if there are too many possible outcomes for the information to be usable otherwise) must be listed as a separate variable, and the value ‘0’ entered when the subject is not taking this medication and ‘1’ when they are. In some cases, new variables may need to be created over the course of the study if a medication comes up that was not included in the original list.

ID Painmed   ID Acetamin Ibuprof ASA
1 0   1 0 0 0
2 3   2 0 0 1
3 2   3 0 1 0
4 1   4 1 0 0
      5 1 1 0

In the table on the left, the single variable PAINMED is used to record the type of pain medication taken by each subject, where each subject has used only one medication type over the period of interest. The value ‘0’ can be assigned to ‘no medication’, the value 1 can then be assigned to acetaminophen, 2 to ibuprofen, 3 to acetylsalicylic acid, etc.  Be sure to include the values in your ‘dictionary’ of variables. In this way, variables can be broken down or regrouped as necessary.

The table on the right should be used when patients may have used more than one type of pain medication over the study period of interest. The value ‘0’ represents ‘no’ and the value ‘1’ represents ‘yes’. Using numbers rather than the words ‘yes’ and ‘no’ reduces typos:  remember, statistical software will treat various versions of the word ‘yes’ as unique different categories (‘Y’, ‘y’, ‘yes’, ‘YES’, ‘Yes’ are 5 different categories). A nice feature of indicator variables (variables with 0 and 1 values) is that their mean corresponds to the proportion of ‘yes’ responses.

When information is obtained from a questionnaire, understanding how the data will ultimately be entered can assist in the design of the questionnaire, streamlining it to simplify data entry. It can be worthwhile to invest in a data management consultation before starting the study to improve the efficiency of the questionnaire and data entry layout.

9. Setting up your data: the LONG versus the WIDE dataset
Imagine a typical dataset, with subjects on each row and variables along the columns. How should data be entered when subjects are seen more than once?  Rather than making your dataset increasingly wide, adding new columns (and more obscure variable names) each time the subject is seen, create a single additional variable called VISIT. This variable records the visit number for each subject, so that a subject who is seen multiple times will have multiple rows in the dataset. For example, subject 23 is seen 3 times. The ID variable on each row is the same:  23. The VISIT variable is recorded as 1 on the first visit, 2 on the second, and so on. The variable HT then records the value of height at each of the 3 visits. It’s easy to extract information on only the first visit for all subjects (if VISIT=1), or to pull out all the data for a single subject (if ID=23) to look at change over time.

ID Visit Vday Vmonth Vyear Ht
23 1 12 2 2004 103
23 2 15 8 2004 105
23 3 7 3 2005 108
24 1 25 5 2003 97

Note to SPSS users: You will need to create a dummy ID variable for SPSS because SPSS wants every dataset to include an identifier for each row. You can create a variable, ‘fakeID’, and simply give it the row number as its value. This variable will keep SPSS happy, and you can proceed to create the real ID variable. Caveat: This will make the statistician happy, but if you later decide to analyze the data in SPSS, it won’t be set up properly for you. If you plan on analyzing your own data at some point, then you’ll have to enter the data using the wide format that SPSS prefers, creating repeat variables like Ht1, Ht2, Ht3. However, get in touch with the statistician who will be analyzing your data to get advice on naming your variables. Repeating the same pattern will allow the statistician to create a macro that will easily create the long dataset if needed.

Similarly, in a study that includes cases and controls, include a single variable, say GROUP, with the value ‘0’ for controls and ‘1’ for cases. If the subjects are matched, an additional variable needs to be included to identify the matched pairs.
 

ID Group Match
12 0 1
13 1 1
14 0 2
15 1 2

There may be times when case control studies are better set up in a ‘wide’ dataset. Check with the statistician who will be analyzing your data to see which format is preferable for your study.

10. More on longitudinal studies
Data from longitudinal studies may be entered more efficiently using multiple datasets. Use one dataset to record demographic information that will not change over the course of the study:  birthdate, date of diagnosis, gender, genetic mutations, etc. When different tests are performed on different schedules, create a separate dataset for each group of tests. You may have one dataset with height and weight measurements collected at every clinic visit (say, every three months), and another dataset with laboratory results that are only obtained once a year. By creating multiple datasets for each of these results, you will avoid creating artificial missing values. This also reduces data entry errors that may arise when birthdates and other demographic variables are entered repeatedly at each clinic visit. All of the datasets MUST include a unique identifier for each patient so that the data can be merged as needed, and longitudinal data MUST include a visit date so that test results and events can be merged in the right order.

11. Double data entry
Whether you are hiring someone to enter the data or doing it yourself, consider doing double data entry.

Double data entry is exactly what it sounds like:  enter all the data, breathe a sigh of relief, and then do it all over again. Even better, find a different person to enter the data the second time!

Both data files can then be imported and compared, to identify all the discrepancies. Using this approach almost guarantees that all data entry errors will be identified. This won’t avoid errors at source (data that are copied incorrectly on the original patient file, or misread because of poor handwriting), but is nevertheless an invaluable tool for limiting the number of errors in the dataset. The investment of time and money pays for itself, particularly when dealing with ongoing databases that will be used in multiple studies.