Fix Fixed-Format Data with Vimscript
The main purpose of this post is to share an example on how I use Vimscript to fix fixed-format data. However, I would like to start with a short “need to know” introduction to Vim. Vim exists on multiple platforms (GNU/Linux, OS X, and Windows) making it very easy to share functionality.
Background
A couple of years ago I started learning basic stuff about GNU/Linux. Simply because I felt that I needed more control over the operating system from a user’s perspective. I needed a serious text editor and then I found Vim. I LOVE IT! Not only for executing commands but also for making a function in Vimscript once in a while when needed.
When it comes to text editing with Vim, I am “all sunshine and rainbows”. I usually search for patterns in simple text (data records) and identify content in certain markup languages (mainly XML). There is one command in particular that I use almost all the time because its formulation is so simple yet extremely powerful. Here it is:
%s/find/replace/gn
The content of “find” is the search pattern, whereas the content of “replace” is the substitution pattern. The first sign “%” is used to indicate global search, that is, all lines. The last letter “n” means that the command will run “dry” without actually replacing anything. After running the command the total number of matches and the total number of target lines will show above the command line. This is very useful for counting matches, for example, to count how many times “try me” occurs simply type (\c for case-insensitive):
%s/\ctry me//gn
Regex Engine
Any serious text editor needs a good regex engine. I used to work with Notepad++ but the regex engine failed in some use cases (i.e. case sensitivity was out of order in multiple expressions). The default regex engine in Vim has NEVER failed me.
One word of warning though, remember what to escape! The following table shows the special characters that require escaping:
\< | Matches beginning of a word (left word break/boundary) |
\> | Matches end of a word (right word break/boundary) |
\(…\) | Grouping into an atom |
\| | Separating alternatives |
\_. | Matches any single character or end-of-line |
\+ | 1 or more of the previous atom (greedy) |
\= | 0 or one of the previous atom (greedy) |
\? | 0 or one of the previous atom (greedy) |
Multi-item count match specification (greedy) | |
\{n,m} | n to m occurrences of the preceding atom (as many as possible) |
\{n} | Exactly n occurrences of the preceding atom |
\{n,} | At least n occurrences of the preceding atom (as many as possible) |
\{,m} | 0 to n occurrences of the preceding atom (as many as possible) |
\{} | 0 or more occurrences of the preceding atom (as many as possible) |
Multi-item count match specification (non-greedy) | |
\{-n,m} | n to m occurrences of the preceding atom (as few as possible) |
\{-n} | Exactly n occurrences of the preceding atom |
\{-n,} | At least n occurrences of the preceding atom (as few as possible) |
\{-,m} | 0 to n occurrences of the preceding atom (as few as possible) |
\{-} | 0 or more occurrences of the preceding atom (as few as possible) |
Vim has options like very magic (\v) to determine what characters have a special meaning. Thus, I sometimes find myself fumbling with escape rules in the old, backtracking engine that supports everything. To give an example, if I want to remove the class attributes in the above html table, this regex with standard escape rules works fine:
%s/\(\sclass="\w*"\)\(>\)/\2/gn
However, the very magic option works as well, but requires a new set of escape rules:
%s/\v(\sclass\="\w*")(\>)/\2/gn
I always use the default escape option to avoid confusion. The default option is magic (\m), which means that special characters like “.” and “*” need not be escaped.
Configuration
Vim can easily be configured. Simply change the content of the file .vimrc. I suggest adding the following options as standard:
" Load files so /vimscripts.vim " Set session options set ssop=blank,buffers,curdir,folds,help,options,tabpages,winsize,resize,winpos let sessionLoad = 1 " Start in normal mode, disable line wrap and swap files, set encoding set noim noswf nowrap nu enc=utf-8 nobomb fenc=utf-8 " Set end of line set ff=unix " Default theme syntax enable set background=dark colorscheme jellybeans " Set default font size set guifont=Courier\ New\ 16 " Set working directory to the same as the file being edited set autochdir
Vim supports sessions which is very nice for saving different views. The above .vimrc specifies what to save in the session file. I prefer to start in normal mode with line numbers shown but without line wrap. I strongly suggest disabling swap files unless you need the security backup.
Be AWARE that the character encoding in the Vim buffer is specified here in the .vimrc. However, vim allows the user to specify another input encoding when opening a new file, for example “:edit ++enc=iso-8859-1 my_file”. In addition, vim also allows the user to save in another encoding by changing the fenc option before saving, for example “:set fenc=iso-8859-1”.
The above .vimrc also includes a color scheme called Jelly Beans. Download the color scheme and place it in .vim/colors. Last, the working directory is set to the same as the file being edited, which is useful for opening other files in the same directory.
Below I have shown certain configurations that I prefer to use:
" Activate folding set foldmethod=indent set foldnestmax=10 set nofoldenable set foldlevel=2 " Search highlighting set incsearch hlsearch " Activate autoindent and smartindent set ai si " Start with a maximized window set lines=1080 columns=1920 " Enable horizontal scroll bar set guioptions+=b " Show unprintable characters as a hex number set dy=uhex " Start with conventional windows shortcuts so $VIMRUNTIME/mswin.vim " Alt 1 to shift tabs nmap <M-1> gt imap <M-1> <C-O>gt " CTRL-F to find & replace nmap <C-F> :promptrepl<CR> vmap <C-F> <C-C>:promptrepl<CR> imap <C-F> <C-O>:promptrepl<CR> " Use <F3> to clean fixed-format data nmap <F3> :1,$call FixFixedFormatData()<CR> imap <F3> <C-O>:1,$call FixFixedFormatData()<CR>
The above configuration sets folding, search highlighting, auto- and smartindent, windows size, horizontal scroll bar (yes, it is not there per default), unprintable characters as a hex number, and most importantly loads a script with keymapping to conventional windows shortcuts (I was raised with windows computers). The last part shows custom keymapping. Please note that FixFixedFormatData() can be called by pressing the F3 key.
Fix Fixed-Format Data
This leads to the final topic of my post, which is explaining what FixFixedFormatData() is doing. The function reads fixed-format data, for example:
123456
654321
The column width is fixed. In the example above each column has a width of 1, and it can be assumed that each line contains a new record. There are six columns in total, the first row is 123456 and the second row is 654321.
The first task for FixFixedFormatData() is to remove control characters because they carry no useful information. However, since the format is fixed, I have created a sub function that replaces control characters with a space:
" Replace control characters with space (utf-8) function! ReplaceControlCharactersWithSpace() range let c1 = 0 let c2 = 0 for linenum in range(a:firstline, a:lastline) let c1 += 1 let curr_line = getline(linenum) " Line feeds 'x0a' are NOT deleted let replacement = substitute(curr_line, '\%x00\|\%x01\|\%x02\|\%x03\|\%x04\|\%x05\|\%x06\|\%x07\|\%x08\|\%x09\|\%x0b\|\%x0c\|\%x0d\|\%x0e\|\%x0f\|\%x10\|\%x11\|\%x12\|\%x13\|\%x14\|\%x15\|\%x16\|\%x17\|\%x18\|\%x19\|\%x1a\|\%x1b\|\%x1c\|\%x1d\|\%x1e\|\%x1f', ' ', 'g') if replacement != curr_line let c2 += 1 "echo 'LINJE '.c1.' INDHOLDER EN ELLER FLERE KONTROLTEGN' endif call setline(linenum, replacement) endfor echo 'DER ER SLETTET KONTROLTEGN I '.c2.'/'.(a:lastline - a:firstline + 1).' LINJER' endfunction
Removing the control characters is important. However, the most critical task for FixFixedFormatData() is to replace invalid line feeds in the text, meaning those line feeds that are NOT end of lines. The challenge is shown in this example data file:
My name is Henrik <lf>Sejersen<lf>
What is your name? <lf>
This is a fixed-format data file with a column width of 28 bytes. The invalid line feed in the first row (after Henrik) has to be replaced by a space in order to maintain the fixed record length of 28 bytes per line. A new sub function has been created to accomplish this task. The sub function creates a new cleaned data file and prints a short output/validation report:
" Delete invalid line feeds function! DeleteInvalidLineFeeds() range let c1 = 0 " Add space to the end of all lines for linenum in range(a:firstline, a:lastline) let curr_line = getline(linenum) let replacement = substitute(curr_line, '$', ' ', 'g') call setline(linenum, replacement) endfor " Make list to hold output lines let result = [] " Find longest line in input let s:ll = max(map(range(1, line('$')), "col([v:val, '$'])")) - 1 let linenum = 0 let lastline = line('$') while linenum <= lastline let curr_line = getline(linenum) let slcl = strlen(getline(linenum)) "echo 'FIRST '.linenum if slcl == s:ll "echo 'LINE ADDED '.curr_line let curr_line = strpart(curr_line, 0, s:ll - 1) call add(result, curr_line) elseif slcl > s:ll echo 'SCRIPT ERROR: SCRIPTET HAR IKKE IDENTIFICERET DEN LÆNGSTE LINJE' else "echo 'DIRTY LINE' while slcl < s:ll let linenum += 1 let curr_line .= getline(linenum) let slcl = strlen(curr_line) let curr_line = strpart(curr_line, 0, s:ll - 1) endwhile let c1 += 1 "echo 'LINE ADDED'.curr_line call add(result, curr_line) endif let curr_line = '' let linenum += 1 endwhile call setline(1, result) call writefile(result, 'r_'.bufname("%")) let outfile = getcwd().'/'.'r_'.bufname("%") " Show output report echo 'DEN RENSEDE FIL ER GEMT SOM '.outfile echo 'DER ER '.c1.' LINJER MED UGYLDIGE LINE FEEDS I DEN URENSEDE FIL' let open = 'e! '.outfile execute open let s:ln = max(map(range(1, line('$')), "col([v:val, '$'])")) - 1 let s:lo = s:ll - 1 let valid = s:lo/s:ln echo 'DEN RENSEDE FIL INDEHOLDER I ALT '.line('$').' LINJER' echo 'DEN LÆNGSTE LINJE I DEN URENSEDE/RENSEDE FIL ER PÅ '.s:lo.'/'.s:ln.' BYTES' if valid == 1 echo 'RENSNINGEN ER GODKENDT' elseif valid != 1 echo 'DEN RENSEDE FIL INDEHOLDER FEJL OG MÅ IKKE ANVENDES TIL ARKIVERING' endif endfunction
In the end, FixFixedFormatData() simply calls two functions:
function! FixFixedFormatData() range 1,$call ReplaceControlCharactersWithSpace() 1,$call DeleteInvalidLineFeeds() endfunction
Thank you for reading and please feel free to comment below.
Leave a Reply