Sed vs Tail Showdown

Today I was presented with a problem. A colleague of mine did a MySQL dump on a database. The resulting MySQL dump file was 40GB. Problem is, he forgot to tell mysqldump to leave out the drop/create table statements. Here’s the rub:

What is the most efficient way to remove the first 43 lines of text from a 40GB text file?

“Surely there must be a simple *nix utility to do this.” we thought. My first guess was simple:

me: “Just use sed and tell it to delete the first 43 lines. Simple.”

colleague 1: “But won’t the regex engine be very inefficient for that?”

colleague 2: “Meh, I think sed is the right tool for the job.”

Now at this point we have a square off. Colleague 2 feels that the inclusion of the regex engine is simply too much overhead and is like bringing a bazooka to a knife fight. So they start to look around and I start doing some tests with a 1.5GB MySQL dump from Killoggs. Immediately, I see that it’s trying to create a temp file. If it’s creating a temp file, then that means it’s going to either create the temp file and rename it or create the temp file and copy it. Neither of those are especially good solutions because it will result in (at least) a lot of CPU utilization and even after the basic operation is done, there is still more to do in the way of a copy, move, etc.

So the next thought is doing the inverse. Rather than trying to delete the data, maybe we simply don’t want to output it. This logic seems to bear more fruit and the result is the following two potential solutions:

  • a) sed -n ’43,$p’ filename | mysql
  • b) tail -n +43 filename | mysql

Now, the real question? Which is faster? To test this i created a 200MB dummy text file comprised of log data. I then took this and ran the following commands repeatedly. The result? Both sed and tail were about the same as far as both real and system time. They both fluctuated and both came out roughly neck and neck. The real key was user time. Sed was always an order of magnitude slower as far as user time. In the end, this would lead me to concede my initial suggestion and go with tail.

Got thoughts on this? email me at brian …at… logjamming.com!

[bharrington@berstuk tmp]$ time `tail -n +41 test.log > /tmp/file1.test `

real 0m6.479s
user 0m0.077s
sys 0m1.145s
[bharrington@berstuk tmp]$ time `tail -n +41 test.log > /tmp/file1.test `

real 0m4.454s
user 0m0.061s
sys 0m1.091s
[bharrington@berstuk tmp]$ time `tail -n +41 test.log > /tmp/file1.test `

real 0m4.498s
user 0m0.049s
sys 0m1.058s
[bharrington@berstuk tmp]$ time `sed -n ’41,$p’ test.log > /tmp/file2.test `

real 0m5.025s
user 0m1.434s
sys 0m1.063s
[bharrington@berstuk tmp]$ time `sed -n ’41,$p’ test.log > /tmp/file2.test `

real 0m4.862s
user 0m1.351s
sys 0m1.091s
[bharrington@berstuk tmp]$ time `sed -n ’41,$p’ test.log > /tmp/file2.test `

real 0m4.606s
user 0m1.440s
sys 0m1.124s

Comments are closed.