Arithmetic in Emacs Regular Expression Search and Replace

Problem

You have a large number of files on which you want to do a regex search and replace. You want the replacement string to be an arithmetic expression of a part of the regex match. To give a concrete example, say you have a few files which contain times specified in a particular format and you need to do a format conversion. For instance, convert “PT30M” to “1800″ (30 * 60 seconds)

Solution

You can use Emacs regex search and replace with simple arithmetic expressions. The key is to prefix the arithmetic expression with \, which tells Emacs that it’s a LISP expression instead of a string.

For instance to do the aforementioned conversion, type M-x replace-regexp. Then type PT\([0-9]+\)M RET \,(* 60 \#1)

In the above expression, you need to indicate that the regex group is numeric by specifying \#1 instead of just \1. (Otherwise you will get a type error)

If you have a large number of files on which you want to do the above search and replace, you can use dired. Type M-x dired. Then mark the desired files typing m. Then type Q. This will do a query-replace-regexp on each marked file.

Amazon CloudSearch Analytics

Amazon recently announced the availability of Analytics features for Amazon CloudSearch. I spent the last few months on this project (along with my awesome colleagues at A9), helping build this feature from the ground up, so I feel very happy to see it out there.

The Analytics features provides CloudSearch customers insight into the search activity in their search domains. Some of the metrics provided are:

Search Trends: The time-series metrics of the number of searches and number of searches which yielded no results.

Search_Metrics

Top Searches:  Most frequent queries, most frequent queries which produced no search results.

Top_Searches

Top Documents: The documents most frequently surfaced in search results.

Top_Documents

 So what are the practical use cases for these metrics? Here are some:

  • The Search trends give you high-level information about the footprint of the search domain over time. One of the cool features of CloudSearch is autoscaling – the search fleet is automatically scaled up or down to keep up with the demands of search traffic or data volume. 
  • The Top Searches gives you a flavor of what your customers are searching for in your application.
  •  The Top No-Result searches gives an indication of the recall of the search system. Maybe you have documents which matches barbecue grill, but your customers are searching for bbq grill. In this case, you could configure a bbq as a synonym for barbeque. In some cases, no-result searches could point to a lack of inventory. For instance, if you’re a content site, the no-result searches represent what your customers are looking for in your site, but you don’t have the content for.
  • Top Documents report gives you the opportunity to see if irrelevant documents are ranked high in the search results. It could be an indication that the rank functions need to be tuned to provide more relevant search results.

Installing GNU Octave on Mac OS X

  • Install Homebrew, the best package manager for OS X. Follow the instructions at the Homebrew site.
  • Install XCode (a 1.6 GB download) and XCode command line tools from the Apple developer site or using the Mac App Store. You need to install the command line tools even though XCode is supposed to be a super set because I ran into a bug in Brew causing it to not work properly.
  • $ brew install octave # Brew will download all dependencies and install Octave. Should take an hour.

Now Octave should be working, but the plotting functions will not be functional. For this, you need to install gnuplot separately.

$ brew install gnuplot

Even after setting up gnuplot, you may get the below error when you run a plot function:

octave:4> plot(k, x)

gnuplot> set terminal aqua enhanced title “Figure 1″ size[..]“
^
line 0: unknown or ambiguous terminal type; type just ‘set terminal’ for a list

For this, the workaround is to add the below line to ~/.octaverc

$ cat > ~/.octaverc
setenv GNUTERM x11

Now Octave should be able to plot correctly.

Aloha from Hawaii

We just got back from a vacation to Maui, Hawaii. It was a 5 hour flight from San Francisco. We had a great time there especially since Hawaii felt a lot like Kerala, our home state in India. There were a lot of similarities – warm weather, rain, the landscape, vegetation etc – they grow pineapples, coconut, banana, etc. Also Taro (Chembu as known in Kerala) is a staple food here. They have a saying that if someone said they liked taro, he has to be either a Hawaiian or a liar.

Black-sand beach in Hana.

Black-sand beach at Hana

Hump-back Whale

Hump-back Whale

Food at the Luau

Clouds over Haleakala volcano

Clouds over Haleakala volcano

OpenCV Performance and Threads

If you use OpenCV library, be aware that the library spawns threads for image processing. I found this while investigating a performance issue. It turns out that the default number of threads is equal to the number of CPU cores. So in my dual quad-core box, it was spawning 8 threads per web server process, resulting in very bad performance. Creating threads per request is very bad for throughput anyway and won’t scale for high-traffic applications.
Explicitly setting the number of threads as 1 gave a 15x speed boost for my application. Not bad for a one-line code change. Have a look at cv::setNumThreads() if you are using the C++ library and cvSetNumThreads() if you are using the Python wrapper.

If you use the OpenCV library, be aware that it spawns threads for image processing. I found this while investigating a performance issue in a web application I was working on. It turns out the default number of threads is equal to the number of CPU cores. So in my dual quad-core box, it was spawning 8 threads per web server process, resulting in poor throughput while serving concurrent requests. This default behavior of OpenCV is probably targeted towards desktop applications where it makes sense to use all the available CPU cores. The performance problem arose from the fact that even under 5 rps, there were 40 threads, all competing for the CPU, so the cost of context switching was significant. In any case, creating threads on the fly per request is not a good idea for a server-side application and it’s not going to scale for high-traffic systems.

Explicitly setting the number of threads as 1 improved the throughput and latency of my application several times. Not bad for a one-line code change. Have a look at cv::setNumThreads() if you are using the C++ library and cvSetNumThreads() if you are using the Python wrapper.

Yahoo! Buzz Topic Pages Are Live!

Topic pages are live in the Yahoo! Buzz U.S. site.  This is the project I’ve been working on for the last few months. The idea is to algorithmically generate topic pages for the buzzing topics of the day.  The popular topics can be accessed from the “top topics” nav bar in the Buzz site. For a sample, click here to see the topic page about Chelsea Clinton.  A screenshot:

Yahoo! Buzz Topics

Fun fact: We used these programming languages to build the back-end systems which power the site: Perl, PHP, Python, Java. Not to mention different storage and indexing systems, databases, servers, in-house and external frameworks, libraries etc.

This is the hard-work of a lot of smart and dedicated people who I have the privilege to work with.  Please check it out and let me know what you think. Do you find it useful?  Did you run into any bugs?  Send me an email.  Keep watching this space for updates.  There is lot more to come, I promise.

2 + 2 = 4

I’m nearing the end of a  two-week vacation to India.  The long flight + free time gave me the opportunity to read a few books.  There are a couple of ones I thought was worth mentioning:

High Fidelity is  a movie I enjoyed thoroughly. I watched it several years ago and finally read the book last week.  The movie closely follows the book, with some changes, for instance, the the story happens in Chicago as opposed to London. Like the movie, it was humorous and authentic.  It’s not often you read a book and laugh out loud on each page.



Then I read Orwell’s Nineteen Eighty-Four.   The story is futuristic and takes place in 1984, when the world is divided mostly into 3 superpowers which are at a permanent state of war.  The protagonist lives in a country called Oceania which consists of the Americas, British Isles and Australia.  The government is totalitarian and controls every single aspect of the citizen’s life.  Even thinking unorthodox thoughts is punishable by torture and death. The government is working on a subset of the English language called NewSpeak to make it impossible for people to think unorthodox thoughts. While reading it, I was reminded of the East German Stasi.  My favorite quote from the book:

Freedom is the freedom to say that two plus two make four. If that is granted, all else follows.

It takes amazing foresight to write such a book  in 1949. It’s bitingly sarcastic and haunting.  Go read it if you haven’t already!

Biking to Work

I recently bought a bike and started riding it to work. Usually it takes me around 15 minutes to drive to work and it takes just 20 minutes to ride the bike to work. I guess biking to work is a convenient way to get in shape without spending too much time.

Cycling Commute Map

gbookmark2delicious 2.1 is Out

A new version of gbookmark2delicious is out. All the credit goes to Yang Zhang who implemented the new features. Some of them include:

  • Incremental synchronization capability for continuous mirroring of Google Bookmarks onto delicious.
  • Updates to work with current Google Bookmarks and delicious interfaces/formats.
  • Handle throttling and persistent retries for delicious’ REST API.
  • More flexibility via beefed-up CLI frontend (more options, etc.)
  • Local cache of the remotely pulled data.

YUI Compressor

If your web application is JavaScript intensive (these days most apps are), it will be a good idea to minify the JavaScript and CSS code. A minification tool reduces the byte footprint of the code without impacting the semantics. YUI Compressor is the best JavaScript minifier out there.

The product I work on at Yahoo! is pretty heavy on JavaScript and YUI Compressor gives me almost a 50% compression ratio, so the savings are significant.

$ du -hs foo*js
140K foo.js
216K foo_src.js

As you noticed, I keep the original source in CVS with a file name like foo_src.js. During build, YUI compressor is run on foo_src.js and foo.js is generated. The application includes the compressed version, foo.js.