Nov 062012
 

It’s always nice to have a bunch of tricks for processing files easily and quickly. It’s fairly straightforward to remove duplicate lines by sorting a file with a unique filter, maybe using a couple of pipes, but this has the drawback of leaving you with a file now completely out of order. This would be fine if the file was a list, but if it’s a piece of code, it’s now totally useless. There’s a surprisingly quick and easy way to remove subsequent duplicate lines of text from a file without sorting.

Here it is:

awk '!x[$0]++' filename.txt

For example, take a file with this text:

# cat test.txt 
abc
def
ghi
abc
xyz
abc
ghi
plq
def

Run the awk command, and this happens:

# awk '!x[$0]++' test.txt 
abc
def
ghi
xyz
plq

Make a note of this one, because it’s bound to come in handy sooner or later.


Matt Parsons is a freelance Linux specialist who has designed, built and supported Unix and Linux systems in the finance, telecommunications and media industries.

He lives and works in London.

  3 Responses to “Remove duplicate lines in a file without sorting”

  1. Hey I know this is an old post, but how exactly does this work? I’m trying to learn awk and even though I can modify this to suit my needs I’m trying to understand what it’s doing. Usually see awk commands followed by curly brackets to show what to do w/ the text. is it being in quotes make it a sort of filter?

    • Awk can be a bit of a black art, and some tricks like this are not immediately apparent. In fact, it wasn’t until you asked that I actually set down and thought about how it works.

      It’s simpler than it seems. Basically, at it’s simplest you can pass Awk a condition and if that condition evaluates as true, then by default, Awk will print the current input line.

      So if you did something like this:

      $ awk '1' file.txt
      

      The output would just be the contents of the file, since each line would evaluate as true (that is, “1”).

      So in the de-duplicating script, it’s evaluating the expression “!x[$0]++’. Breaking this down:

      • $0 is the entire current line.
      • x[$0] is a hash array element which assigns the current line to the hash as a key
      • x[$0]++ post-increments the current element of the hash array, thus increasing every time a duplicate line gets assigned (to the same element).
      • !x[$0]++ returns true if x[$0] is 0, and false if it’s anything else (since it negates the value). The post-increment happens after this is evaluated.

      So the simple answer is, the expression only evaluates as true, and therefore prints the current line, if the line hasn’t been seen already.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>