Friday, October 29, 2010

Finding Files On Your Hard Drive
With the Linux Find Command

 
I love the Linux find command. The
find command is used to find files.

Here are some of my favorite things about
the find command:

  1. You can use it to find a file by name
  2. You can use wildcards with it
  3. It automatically descends into folders
    underneath the current folder
  4. It prints out the path to every file
    it finds

The ability of the find command to descend
into directories (folders) is known as recursive
descent. Each layer of directories found under
the current layer of directories is another layer
of recursion. Recursive descent is a well-known
computer algorithm used by many programmers.

Basically, the find command consists of
4 parts:

  1. The name find
  2. Where to start looking
  3. What files to look for
  4. What to do when you find files

Here's an example:

find . -name abc -print

Here's the 4 parts of the above
find command:

  1. The name of the command is find
  2. Dot is the name of the current directory
  3. We are looking for a file called abc
  4. Once a file called abc is found, the
    find command will print the path to it

Here's what happens in plain English:

We start looking in the current directory
(dot or period) for a file called abc.
We will uncover all possible sub-directories
of the current directory. Any files found
that are called abc will be printed.

Here's the only thing that is tricky about
the find command: It shares wildcards with
the shell. This can be trickier than it
sounds. Let's say I wish to find a file
that starts with the letters abc.
I might type the following command:

find . -name abc* -print

This will probably work. As long as
there are no files in the current
directory that start with an abc,
all will be well.

However, lets say we have a file called
abcdef in the current directory.
We are now in trouble. We are in trouble
because the shell is going to do file-name
expansion prior to executing
the find command.

Here's what we type:
find . -name abc* -print

Here's how the shell interprets what we
typed:

find . -name abcdef -print

Do you see the problem? The find command
never sees the asterisk. What happens is that
the find command sees abc* only
after it has been expanded to abcdef.
Big difference!

Of course, there is a way around this and that
is to remove the special meaning of the asterisk
with a backslash. Here's what this would look like:

find . -name abc\* -print

In actual practice, though, the practice of using
backslashes on a command line is very clumsy. Most
people use double quotes instead. Here's what double
quotes look like:

find . -name "abc*" -print

The double quotes escape any special meanings
including the special meaning of the asterisk.
Now we can rest easy and know that our filename
expansion characters will reach the
find command untouched.

The double quotes are a wonderful habit to get
into. Basically, you can use the double quotes
regardless of whether you are using filename
expansion characters or not. Let's say, for
example, we are looking for a file called
abc.

Here's how I might apply the double quotes:

file . -name "abc" -print

In this case, the double quotes do not
matter. Since there are no filename
expansion characters, the double
quotes serve no purpose.

Here's why I use double quotes anyway:

If you always use double quotes, you
never need rethink the find command.
It just works no matter what. Rather
than think whether double quotes are needed,
just use them. They don't cost anything
other than 2 keystrokes.

This is more valuable than it might
appear. When you are in the heat of
battle and you are trying to solve
a problem, considering whether or not
to use double quotes is a mental
distraction.

Rather than suffer the distraction, just
use the double quotes. It's not hard to
figure out whether or not you need double
quotes, but why think about it at all?

Ed Abbott

Tuesday, October 12, 2010

The diff Command Under Linux

 
The diff command under Linux
is one of my favorite Unix commands.
It's an ancient command that's still very
useful in a modern world. I call any command
that dates back to the 1970's ancient.

I was recently asked by someone over the
phone how to find the changes made to a
website by a web developer. How do you
find their changes if you have a complete
copy of the website before the changes and
a complete copy of the website after the
changes?

I told him that you need 3 things to do
this:

  1. A complete copy of the website
    before and after
  2. The Unix ls -lt command
  3. The diff command

Start by looking at the complete copy
of the new website. Start in the topmost
directory (folder) of the website and do
this command:

ls -lt

This will give you a list of both files
and directories sorted in timestamp
order. Directories recently modified
need to be investigated further. Files
recently modified need to be noted.

In any case, both files and directories
of recent vintage will rise to the top
of the ls -t listing.

Using this list, you can easily find things
that have been modified after the web developer
(who made changes) took over.

If a file, make a note. If a directory, look
further.

Keep looking into directories that have been
modified since the new web developer started
working on the site. Once you've found all
the files that have been modified after a certain
date, you are done with ls -lt.

This will take less time than it might seem as
web developers typically only modify a few files
on each occasion that they work on a site. For
example, if the web developer only worked on the
Contact Us page, this may be the only file
that was modified. This being the case, you will
find the file relatively quickly.

Next, use the diff command to figure out
what changed on the Contact Us page.

Here's how you might use the diff command
hypothetically:

diff ../old/contactus.html ../new/contactus.html >temp

I've fictionalized the directories where the
old and new Contact Us pages would be
found. Undoubtedly, you will have to do a bit
more typing than I did in my hypothetical example
above to get a diff on the two files.

Notice that I've placed the difference between
the two files in a file called temp. This
is a temporary file that has all the changes.

If the changes are not too extensive, the file
called temp will be quite short. It could
be something as simple as a new phone number or
a new business address.

A Contact Us page consists of contact
information so the changes to it would not
necessarily be anything more than a slight
update.

How long would it take me to find all of this
out? Discounting the time it takes me to obtain
two copies the the two websites, I'd say maybe
5 minutes.

Here's the steps I would take in that 5 minutes:

  1. Find the most recent timestamp in the old
    copy of the website. In other words, do a
    ls -lt on the old topmost directory. Be
    sure to discount things like server logs and
    other things that are automatically updated
  2. Use the timestamp discovered at the old site
    to determine what is new at the new site
  3. Do the steps given above to discover what
    files are newer than the timestamp discovered
    on the old copy of the website

That, in a nutshell, would be how I would discover
work done recently by a web developer. Here
are some basic principles that are at work here:

  1. In life you generally need a reference point
    if you are to get anywhere. In this case the
    reference is the file last worked on on the old
    copy of the website
  2. In life, it is helpful to know how far you've
    come since you last referenced where you were. The
    technique of using ls -lt to progressively
    descend directories looking for recent file changes
    to the new copy of the website does this. It tells
    you have things have progressed since the last
    checkpoint
  3. It helps to have a basis of comparison. The
    diff command gives you a wonderful way to
    compare two files looking for changes

Because of their primitive nature, I don't know
of anything that supersedes the old Unix command-line
commands. I've never ever discovered anything that
is quite like them in flexibility, scope, and power.

Of course, it takes a little bit of creativity to
combine and use these commands effectively. If there
is a downside, that would be it. You cannot be half
asleep and use Unix commands effectively. You have
to be a person who does not mind exercising a little
creativity. If you enjoy being creative, Unix command-line
commands may be for you.

One more thing: I've oversimplified things somewhat
to help you find files on a small website. On a
large website you may have problems with the
techniques outline above.

While the parent directory of a file will reflect the
most recently modified file in the parent directory, the
grandparent directory will not necessarily reflect this
same timestamp. In other words, the topmost directory
of a website does not necessarily reflect the most recently
modified file on the website. That's one problem.

Another problem is if you have to go digging through many
many files and directories. If this is the case, the
technique outlined above could prove difficult or impossible.

In the case of added complexity, you may wish to change
technique somewhat. The following web page describes how
to use the find command to find a file of recent
vintage:

Using the -newer option of the find command

Even though the find command is more efficient
if you absolutely need to know the most recently modified
files on a website, the ls -lt command is still
useful. The ls -lt is a much quicker and simpler
way to survey the general situation and get a general
take on how recently the website has been modified.

There's another principle at work here and that's scaling
your solutions to your problems. In life you generally
don't want to use a large-scale solution to solve a small
problem. You typically would not use a backhoe to dig
a hole for a fence post.

So, depending on the scale of what you are looking at,
either find -newer or ls -lt may be the
ideal tool to pick up and use in your situation.

Ed Abbott