Fix Fixed-Format Data with Vimscript

The main purpose of this post is to share an example on how I use Vimscript to fix fixed-format data. However, I would like to start with a short “need to know” introduction to Vim. Vim exists on multiple platforms (GNU/Linux, OS X, and Windows) making it very easy to share functionality.

Jump directly to a section of interest:

Background, Regex Engine, Configuration, Fix Fixed-Format Data

Background

A couple of years ago I started learning basic stuff about GNU/Linux. Simply because I felt that I needed more control over the operating system from a user’s perspective. I needed a serious text editor and then I found Vim. I LOVE IT! Not only for executing commands but also for making a function in Vimscript once in a while when needed.

When it comes to text editing with Vim, I am “all sunshine and rainbows”. I usually search for patterns in simple text (data records) and identify content in certain markup languages (mainly XML). There is one command in particular that I use almost all the time because its formulation is so simple yet extremely powerful. Here it is:

The content of “find” is the search pattern, whereas the content of “replace” is the substitution pattern. The first sign “%” is used to indicate global search, that is, all lines. The last letter “n” means that the command will run “dry” without actually replacing anything. After running the command the total number of matches and the total number of target lines will show above the command line. This is very useful for counting matches, for example, to count how many times “try me” occurs simply type (\c for case-insensitive):

Regex Engine

Any serious text editor needs a good regex engine. I used to work with Notepad++ but the regex engine failed in some use cases (i.e. case sensitivity was out of order in multiple expressions). The default regex engine in Vim has NEVER failed me.

One word of warning though, remember what to escape! The following table shows the special characters that require escaping:

\< Matches beginning of a word (left word break/boundary)
\> Matches end of a word (right word break/boundary)
\(…\) Grouping into an atom
\| Separating alternatives
\_. Matches any single character or end-of-line
\+ 1 or more of the previous atom (greedy)
\= 0 or one of the previous atom (greedy)
\? 0 or one of the previous atom (greedy)
Multi-item count match specification (greedy)
\{n,m} n to m occurrences of the preceding atom (as many as possible)
\{n} Exactly n occurrences of the preceding atom
\{n,} At least n occurrences of the preceding atom (as many as possible)
\{,m} 0 to n occurrences of the preceding atom (as many as possible)
\{} 0 or more occurrences of the preceding atom (as many as possible)
Multi-item count match specification (non-greedy)
\{-n,m} n to m occurrences of the preceding atom (as few as possible)
\{-n} Exactly n occurrences of the preceding atom
\{-n,} At least n occurrences of the preceding atom (as few as possible)
\{-,m} 0 to n occurrences of the preceding atom (as few as possible)
\{-} 0 or more occurrences of the preceding atom (as few as possible)

Vim has options like very magic (\v) to determine what characters have a special meaning. Thus, I sometimes find myself fumbling with escape rules in the old, backtracking engine that supports everything. To give an example, if I want to remove the class attributes in the above html table, this regex with standard escape rules works fine:

However, the very magic option works as well, but requires a new set of escape rules:

I always use the default escape option to avoid confusion. The default option is magic (\m), which means that special characters like “.” and “*” need not be escaped.

Configuration

Vim can easily be configured. Simply change the content of the file .vimrc. I suggest adding the following options as standard:

Vim supports sessions which is very nice for saving different views. The above .vimrc specifies what to save in the session file. I prefer to start in normal mode with line numbers shown but without line wrap. I strongly suggest disabling swap files unless you need the security backup.

Be AWARE that the character encoding in the Vim buffer is specified here in the .vimrc. However, vim allows the user to specify another input encoding when opening a new file, for example “:edit ++enc=iso-8859-1 my_file”. In addition, vim also allows the user to save in another encoding by changing the fenc option before saving, for example “:set fenc=iso-8859-1”.

The above .vimrc also includes a color scheme called Jelly Beans. Download the color scheme and place it in .vim/colors. Last, the working directory is set to the same as the file being edited, which is useful for opening other files in the same directory.

Below I have shown certain configurations that I prefer to use:

The above configuration sets folding, search highlighting, auto- and smartindent, windows size, horizontal scroll bar (yes, it is not there per default), unprintable characters as a hex number, and most importantly loads a script with keymapping to conventional windows shortcuts (I was raised with windows computers). The last part shows custom keymapping. Please note that FixFixedFormatData() can be called by pressing the F3 key.

Fix Fixed-Format Data

This leads to the final topic of my post, which is explaining what FixFixedFormatData() is doing. The function reads fixed-format data, for example:

123456
654321

The column width is fixed. In the example above each column has a width of 1, and it can be assumed that each line contains a new record. There are six columns in total, the first row is 123456 and the second row is 654321.

The first task for FixFixedFormatData() is to remove control characters because they carry no useful information. However, since the format is fixed, I have created a sub function that replaces control characters with a space:

Removing the control characters is important. However, the most critical task for FixFixedFormatData() is to replace invalid line feeds in the text, meaning those line feeds that are NOT end of lines. The challenge is shown in this example data file:

My name is Henrik <lf>Sejersen<lf>
What is your name? <lf>

This is a fixed-format data file with a column width of 28 bytes. The invalid line feed in the first row (after Henrik) has to be replaced by a space in order to maintain the fixed record length of 28 bytes per line. A new sub function has been created to accomplish this task. The sub function creates a new cleaned data file and prints a short output/validation report:

So in the end, FixFixedFormatData() simply calls two other functions:

Thank you for reading and please feel free to comment below.

 

Leave a Reply

Your email address will not be published.