Fix Fixed-Format Data with Vimscript

The main purpose of this post is to share an example on how I use Vimscript to fix fixed-format data. However, I would like to start with a short “need to know” introduction to Vim. Vim exists on multiple platforms (GNU/Linux, OS X, and Windows) making it very easy to share functionality.

Background

A couple of years ago I started learning basic stuff about GNU/Linux. Simply because I felt that I needed more control over the operating system from a user’s perspective. I needed a serious text editor and then I found Vim. I LOVE IT! Not only for executing commands but also for making a function in Vimscript once in a while when needed.

When it comes to text editing with Vim, I am “all sunshine and rainbows”. I usually search for patterns in simple text (data records) and identify content in certain markup languages (mainly XML). There is one command in particular that I use almost all the time because its formulation is so simple yet extremely powerful. Here it is:

%s/find/replace/gn

The content of “find” is the search pattern, whereas the content of “replace” is the substitution pattern. The first sign “%” is used to indicate global search, that is, all lines. The last letter “n” means that the command will run “dry” without actually replacing anything. After running the command the total number of matches and the total number of target lines will show above the command line. This is very useful for counting matches, for example, to count how many times “try me” occurs simply type (\c for case-insensitive):

%s/\ctry me//gn

Regex Engine

Any serious text editor needs a good regex engine. I used to work with Notepad++ but the regex engine failed in some use cases (i.e. case sensitivity was out of order in multiple expressions). The default regex engine in Vim has NEVER failed me.

One word of warning though, remember what to escape! The following table shows the special characters that require escaping:

\<	Matches beginning of a word (left word break/boundary)
\>	Matches end of a word (right word break/boundary)
\(…\)	Grouping into an atom
\\|	Separating alternatives
\_.	Matches any single character or end-of-line
\+	1 or more of the previous atom (greedy)
\=	0 or one of the previous atom (greedy)
\?	0 or one of the previous atom (greedy)
	Multi-item count match specification (greedy)
\{n,m}	n to m occurrences of the preceding atom (as many as possible)
\{n}	Exactly n occurrences of the preceding atom
\{n,}	At least n occurrences of the preceding atom (as many as possible)
\{,m}	0 to n occurrences of the preceding atom (as many as possible)
\{}	0 or more occurrences of the preceding atom (as many as possible)
	Multi-item count match specification (non-greedy)
\{-n,m}	n to m occurrences of the preceding atom (as few as possible)
\{-n}	Exactly n occurrences of the preceding atom
\{-n,}	At least n occurrences of the preceding atom (as few as possible)
\{-,m}	0 to n occurrences of the preceding atom (as few as possible)
\{-}	0 or more occurrences of the preceding atom (as few as possible)

Vim has options like very magic (\v) to determine what characters have a special meaning. Thus, I sometimes find myself fumbling with escape rules in the old, backtracking engine that supports everything. To give an example, if I want to remove the class attributes in the above html table, this regex with standard escape rules works fine:

%s/\(\sclass="\w*"\)\(>\)/\2/gn

However, the very magic option works as well, but requires a new set of escape rules:

%s/\v(\sclass\="\w*")(\>)/\2/gn

I always use the default escape option to avoid confusion. The default option is magic (\m), which means that special characters like “.” and “*” need not be escaped.

Configuration

Vim can easily be configured. Simply change the content of the file .vimrc. I suggest adding the following options as standard:

" Load files
so /vimscripts.vim

" Set session options
set ssop=blank,buffers,curdir,folds,help,options,tabpages,winsize,resize,winpos
let sessionLoad = 1

" Start in normal mode, disable line wrap and swap files, set encoding
set noim noswf nowrap nu enc=utf-8 nobomb fenc=utf-8

" Set end of line
set ff=unix

" Default theme
syntax enable
set background=dark
colorscheme jellybeans

" Set default font size
set guifont=Courier\ New\ 16

" Set working directory to the same as the file being edited
set autochdir

Vim supports sessions which is very nice for saving different views. The above .vimrc specifies what to save in the session file. I prefer to start in normal mode with line numbers shown but without line wrap. I strongly suggest disabling swap files unless you need the security backup.

Be AWARE that the character encoding in the Vim buffer is specified here in the .vimrc. However, vim allows the user to specify another input encoding when opening a new file, for example “:edit ++enc=iso-8859-1 my_file”. In addition, vim also allows the user to save in another encoding by changing the fenc option before saving, for example “:set fenc=iso-8859-1”.

The above .vimrc also includes a color scheme called Jelly Beans. Download the color scheme and place it in .vim/colors. Last, the working directory is set to the same as the file being edited, which is useful for opening other files in the same directory.

Below I have shown certain configurations that I prefer to use:

" Activate folding
set foldmethod=indent   
set foldnestmax=10
set nofoldenable
set foldlevel=2

" Search highlighting
set incsearch hlsearch

" Activate autoindent and smartindent
set ai si

" Start with a maximized window
set lines=1080 columns=1920

" Enable horizontal scroll bar
set guioptions+=b

" Show unprintable characters as a hex number
set dy=uhex

" Start with conventional windows shortcuts
so $VIMRUNTIME/mswin.vim

" Alt 1 to shift tabs
nmap <M-1> gt
imap <M-1> <C-O>gt

" CTRL-F to find & replace
nmap <C-F> :promptrepl<CR>
vmap <C-F> <C-C>:promptrepl<CR>
imap <C-F> <C-O>:promptrepl<CR>

" Use <F3> to clean fixed-format data
nmap <F3> :1,$call FixFixedFormatData()<CR>
imap <F3> <C-O>:1,$call FixFixedFormatData()<CR>

The above configuration sets folding, search highlighting, auto- and smartindent, windows size, horizontal scroll bar (yes, it is not there per default), unprintable characters as a hex number, and most importantly loads a script with keymapping to conventional windows shortcuts (I was raised with windows computers). The last part shows custom keymapping. Please note that FixFixedFormatData() can be called by pressing the F3 key.

Fix Fixed-Format Data

This leads to the final topic of my post, which is explaining what FixFixedFormatData() is doing. The function reads fixed-format data, for example:

123456
654321

The column width is fixed. In the example above each column has a width of 1, and it can be assumed that each line contains a new record. There are six columns in total, the first row is 123456 and the second row is 654321.

The first task for FixFixedFormatData() is to remove control characters because they carry no useful information. However, since the format is fixed, I have created a sub function that replaces control characters with a space:

" Replace control characters with space (utf-8)
function! ReplaceControlCharactersWithSpace() range
	let c1 = 0
	let c2 = 0
	for linenum in range(a:firstline, a:lastline)
		let c1 += 1
		let curr_line = getline(linenum)
		" Line feeds 'x0a' are NOT deleted
		let replacement = substitute(curr_line, '\%x00\|\%x01\|\%x02\|\%x03\|\%x04\|\%x05\|\%x06\|\%x07\|\%x08\|\%x09\|\%x0b\|\%x0c\|\%x0d\|\%x0e\|\%x0f\|\%x10\|\%x11\|\%x12\|\%x13\|\%x14\|\%x15\|\%x16\|\%x17\|\%x18\|\%x19\|\%x1a\|\%x1b\|\%x1c\|\%x1d\|\%x1e\|\%x1f', ' ', 'g')

		if replacement != curr_line
			let c2 += 1
			"echo 'LINJE '.c1.' INDHOLDER EN ELLER FLERE KONTROLTEGN'
		endif
		call setline(linenum, replacement)
	endfor
	echo 'DER ER SLETTET KONTROLTEGN I '.c2.'/'.(a:lastline - a:firstline + 1).' LINJER'
endfunction

Removing the control characters is important. However, the most critical task for FixFixedFormatData() is to replace invalid line feeds in the text, meaning those line feeds that are NOT end of lines. The challenge is shown in this example data file:

My name is Henrik <lf>Sejersen<lf>
What is your name? <lf>

This is a fixed-format data file with a column width of 28 bytes. The invalid line feed in the first row (after Henrik) has to be replaced by a space in order to maintain the fixed record length of 28 bytes per line. A new sub function has been created to accomplish this task. The sub function creates a new cleaned data file and prints a short output/validation report:

" Delete invalid line feeds
function! DeleteInvalidLineFeeds() range
	let c1 = 0
	" Add space to the end of all lines
	for linenum in range(a:firstline, a:lastline)
		let curr_line = getline(linenum)
		let replacement = substitute(curr_line, '$', ' ', 'g')
		call setline(linenum, replacement)
	endfor

	" Make list to hold output lines
	let result = []

	" Find longest line in input
	let s:ll = max(map(range(1, line('$')), "col([v:val, '$'])")) - 1

	let linenum = 0
	let lastline = line('$')
	while linenum <= lastline
		let curr_line = getline(linenum)
		let slcl = strlen(getline(linenum))

		"echo 'FIRST '.linenum
		if slcl == s:ll
			"echo 'LINE ADDED '.curr_line
			let curr_line = strpart(curr_line, 0, s:ll - 1)
			call add(result, curr_line)
		elseif slcl > s:ll
			echo 'SCRIPT ERROR: SCRIPTET HAR IKKE IDENTIFICERET DEN LÆNGSTE LINJE'
		else
			"echo 'DIRTY LINE'
			while slcl < s:ll
				let linenum += 1
				let curr_line .= getline(linenum)
				let slcl = strlen(curr_line)
				let curr_line = strpart(curr_line, 0, s:ll - 1)
			endwhile
			let c1 += 1
			"echo 'LINE ADDED'.curr_line
			call add(result, curr_line)
		endif
		let curr_line = ''
		let linenum += 1
	endwhile
	call setline(1, result)
	call writefile(result, 'r_'.bufname("%"))
	let outfile = getcwd().'/'.'r_'.bufname("%")

	" Show output report
	echo 'DEN RENSEDE FIL ER GEMT SOM '.outfile
	echo 'DER ER '.c1.' LINJER MED UGYLDIGE LINE FEEDS I DEN URENSEDE FIL'
	let open = 'e! '.outfile
	execute open
	let s:ln = max(map(range(1, line('$')), "col([v:val, '$'])")) - 1
	let s:lo = s:ll - 1
	let valid = s:lo/s:ln
	echo 'DEN RENSEDE FIL INDEHOLDER I ALT '.line('$').' LINJER'
	echo 'DEN LÆNGSTE LINJE I DEN URENSEDE/RENSEDE FIL ER PÅ '.s:lo.'/'.s:ln.' BYTES'
	if valid == 1
		echo 'RENSNINGEN ER GODKENDT'
	elseif valid != 1
		echo 'DEN RENSEDE FIL INDEHOLDER FEJL OG MÅ IKKE ANVENDES TIL ARKIVERING'
	endif
endfunction

In the end, FixFixedFormatData() simply calls two functions:

function! FixFixedFormatData() range
	1,$call ReplaceControlCharactersWithSpace()
	1,$call DeleteInvalidLineFeeds()
endfunction

Thank you for reading and please feel free to comment below.

Spread the love

Background

Regex Engine

Configuration

Fix Fixed-Format Data

Leave a Reply Cancel reply