Extracting Metadata from Data Files
This post is about extracting metadata from data files created by Stata, SAS, and SPSS. The extraction is based on available features in the statistical software applications.
Background
The data files from the most commonly used statistical software applications (i.e., Stata, SAS, and SPSS) contain embedded metadata. The metadata elements include the variable name, type, format, and description. If the values are codes, the corresponding code list is stored somewhere in the data file. In addition, the code list may contain missing values.
Missing values occur when no data value is stored for the variable in an observation. There are three types of missing values: numeric, special numeric, and character.
- Numeric missing values are represented as a single period (.)
- Special numeric missing values allow for different types of missing data, represented by a single period plus a letter or underscore (.a – .z or ._)
- Character missing values are usually represented as a blank ( )
This applies to all of the statistical applications, with one exception. There are no special numeric missing values in SPSS, where any numeric value can be set as a missing value (should be used only for categorical variables). A code representing a missing value is called a user-defined missing value.
The metadata elements described above are more or less universal for tabular data files. The challenge is getting the metadata out of the proprietary formats. Thus, I was asked to create a program in each of the statistical programming languages that can extract metadata (as well as data) from the data files. In collaboration with my colleagues, I have designed a simple metadata file format to contain the metadata. The metadata file has been defined using Extended Backus–Naur form (EBNF).
The Metadata File
The metadata file consists of nine labels that are represented in capitalized letters. Each of these labels contain metadata as follows:
- SYSTEMNAVN: The name of the statistical application used to create the data file
- DATAFILNAVN: The name of the data file
- DATAFILBESKRIVELSE: An accurate description of the data file’s content and usage
- NØGLEVARIABEL: The name of the key variable(s)
- REFERENCE: Reference(s) to other data file(s)
- VARIABEL: Variable(s) in the data file. Each variable has a name and format
- VARIABELBESKRIVELSE: A description of the variable’s content and usage
- KODELISTE: Code lists. Each code list must reference a format
- BRUGERKODE: User-defined missing values (SPSS only). The values must appear in a corresponding code list
The following shows an example of a metadata file:
SYSTEMNAVN SAS DATAFILNAVN data_file DATAFILBESKRIVELSE This data file contains data from a questionnaire NØGLEVARIABEL var_a var_b REFERENCE 'referenced_data_file "var_d var_e" "var_f var_g"' VARIABEL Vid 4. VkodeNum vkodenuml. VkodeStr $vkodestrl. Vdecimal 9.6 Vdato yymmdd10. Vtid time8. Vdatotid e8601dt19. Vtekst $20. VARIABELBESKRIVELSE Vid 'Heltal (løbenummer)' VkodeNum 'Kategorisk variabel (nummerisk)' VkodeStr 'Kategorisk variabel (streng)' Vdecimal 'Decimaltal' Vdato 'En dato' Vtid 'En tid' Vdatotid 'En datotid' Vtekst 'Vilkårlig tekst' KODELISTE vkodenuml '0' 'Mand' '1' 'Kvinde' '2' 'Kode ikke anvendt i data' 'U' 'Uoplyst' 'I' 'Irrelevant' vkodestrl 'AB101' 'Mand' 'AB102' 'Kvinde' 'AB103' 'Kode ikke anvendt i data' 'ABU' 'Uoplyst' 'ABI' 'Irrelevant' BRUGERKODE
The metadata file must have a corresponding delimited data file:
Vid;VcodeNum;VcodeStr;Vdecimal;Vdate;Vtime;Vdatetime;Vtext 1001;0;AB101;23.434458;2016-05-03;13:35:23;2015-05-03T13:35:23;"Oliver;Schabenberger" 1002;1;AB102;23.437752;2017-06-04;14:36:24;2017-06-04T14:36:24;"John T.; ""Smith" 1003;U;ABU;23.444453;2018-07-05;15:37:25;2018-07-05T15:37:25;"Henrik ""Dude" 1004;I;ABI;23.534454;2019-08-06;16:38:26;2019-08-06T16:38:26;"God; Hygge Stund"
The Syntax Files
The statistical software applications cannot create the input for all of the labels in the metadata file. However, the content of VARIABEL, VARIABELBESKRIVELSE, KODELISTE, and BRUGERKODE can be extracted from the statistical software applications. In addition, the data can be exported as delimited text.
The syntax files for these programs are available here:
Leave a Reply