awk

Nov 062012
 

It’s always nice to have a bunch of tricks for processing files easily and quickly. It’s fairly straightforward to remove duplicate lines by sorting a file with a unique filter, maybe using a couple of pipes, but this has the drawback of leaving you with a file now completely out of order. This would be fine if the file was a list, but if it’s a piece of code, it’s now totally useless. There’s a surprisingly quick and easy way to remove subsequent duplicate lines of text from a file without sorting.

Here it is:

awk '!x[$0]++' filename.txt

For example, take a file with this text:

# cat test.txt 
abc
def
ghi
abc
xyz
abc
ghi
plq
def

Run the awk command, and this happens:

# awk '!x[$0]++' test.txt 
abc
def
ghi
xyz
plq

Make a note of this one, because it’s bound to come in handy sooner or later.


Matt Parsons is a freelance Linux specialist who has designed, built and supported Unix and Linux systems in the finance, telecommunications and media industries.

He lives and works in London.

Jun 082012
 

When chaining together shell commands with pipe ( | ), it’s easy to just turn it into a stream of consciousness. “This” produces “that” which is the input for the next command. This is fine, but it does often produce bad habits which can lead to sloppiness when these idioms are repeated in scripts.

A good example is when you want to find a string in a file, and then print out a single field of that line. Say, you want to find the userid for a given username in the password file.

The stream of consciousness method leads one to think, “grep will give the line that contains the pattern, and then awk will format the result by stripping out the field I need”. So if we want to find the ID of user “ptolemy”, where the line in our password file looks like this:

ptolemy:x:497:1001:icinga:/home/ptolemy:/bin/bash

we might type this:

  # grep ptolemy /etc/passwd | awk -F: '{print $3}'   
(where -F specifies the field separator, a colon in the /etc/passwd)

But this isn’t really correct. If the string “ptolemy” matches somewhere else as well, this will be ambiguous. Moreover, this is just inelegant.

The reason is, AWK already can do pattern matching, and much better than grep can. So you could do this:

  # awk -F: /ptolemy/'{print $3}' /etc/passwd

Or better yet, AWK can specify its string comparison to an individual field – rather than the whole line. So a much better command would be this:

  # awk -F: '$1=="ptolemy" {print $3}' /etc/passwd

Which will precisely return the third password field if and only if the first field of the password file is “ptolemy”. And it uses only one command, rather than two.


Matt Parsons is a freelance Linux specialist who has designed, built and supported Unix and Linux systems in the finance, telecommunications and media industries.

He lives and works in London.

May 102012
 

The Linux shell is a very quick, easy and powerful way of processing text. The AWK interpreter in particular is a lightweight text manipulator and can eaisly be called directly from the command line or a shell script. However, one thing that isn’t immediately obvious is how, within an awk block, to use variables that exist outside it, either from command line environment variable or shell variables set within the calling script.

An example will probably show more clearly what I mean. The following script is intended to take a number as an argument, and then print every Username which has an ID greater than this. Remember that Bash uses numeric variables like “$1” to represent positional invocation arguments, whereas AWK uses them to denote matched fields.

#!/bin/bash
# greater_users.sh
# Usage: ./greater_users.sh UID

uid=$1
awk -v VAR="$uid" -F: '$3>VAR {print $1}' /etc/passwd

Note that the use of VAR inside the AWK block has no “$” sign – AWK doesn’t define variables like this.

The thing to note is that during the execution of AWK, the -v switch is used to assign the value of an external variable, $uid (which was derived from the greater_users.sh script argument), to an AWK internal varable, which I’ve called VAR. VAR can then be called inside the AWK block, in this case to compare it to the third field of the /etc/passwd file, delimited by the colon (:).

The AWK -v switch is a nice bit of glue to help stick your little command line hacks together.


Matt Parsons is a freelance Linux specialist who has designed, built and supported Unix and Linux systems in the finance, telecommunications and media industries.

He lives and works in London.