A Simple Google Search Client in C#
Have you ever felt the need to query Google and process search results from within C#? You can via a RESTful API (among other things) using Google’s Custom Search API. However, you have to sign up for this service and (cha ching) it is not free, at least not entirely. You get “100 search queries per day for free”, but you are billed with “$5 per 1000 queries” beyond (see Google Custom Search API Pricing).
Well, why buy the cow when you can get the milk for free (very metaphorically speaking)? You can always use simple
WebRequests to query Google — with some restrictions, see below — and process the results using specialized helper methods. The advantages of this approach: No need for an API key, no costs, and no dealing with yet another REST API. Today, I’d like to show an simple (naïve!) solution. Of course, I’m not saying that this is the only or always correct approach to the problem, just a cheap one.
Consume-First Approach: Use Case
Coding should always be driven by needs, not by implementations. So let’s see an example of how we would want to use a Google search client:
The example code is simple: We instantiate a
SearchClient providing it with a search query. We then issue the query and iterate over the results, printing 100 hits to the
Implementing the Search Client
Let’s see what an implementation of the
SearchClient could look like:
The client depends on a search
query provided by the user, which is injected through its constructor (line 12). We simply make sure such a query is provided and store it in a private backing field for later usage.
Next, we will implement the client’s public interface, i.e. three specialized query methods as shown in the following snippet:
The query methods are listed from least specialized (most general) to most specialized. Internally, all three of them forward (line 9) or delegate requests (line 22) to their respective successor until the last one (the most specialized method, line 39) does all the heavy lifting.
Please note the
IEnumerable<SearchResultHit> return type in the first two
Query(...) methods. These methods are implemented as iterators, which allows client code to handle their results (search hits as represented by the
SearchResultHit type) one piece at a time getting full LINQ support. The third method
QueryPaged(...), on the other hand, returns a simple list of
SearchResultHits (line 39). Each such list contains a maximum number of search hits retrieved from the Google query and as defined by the
Essentially, that third method “pages” through the search results in sets of ten (or as otherwise defined by the
hitsPerPage parameter) beginning at a certain index (as defined by the
startIndex parameter), while the first two methods iterate over the elements contained in each page,
yielding the search hits found one by one.
Iterators in C# are implemented using the
yield keyword. Execution of
IEnumerable<> is deferred until the last moment when it is iterated over. This means that our method bodies of
Query(...) get executed only when actually consumed (i.e. iterated over), not when being merely called. This allows the client to process as many items (in our case search hits) as required, saving resources (time and computational effort) by not processing hits in advance.
Finally, the innermost
QueryPaged(...) method is the point where our code communicates with the Google server and retrieves the actual search results. In order to mitigate the performance overhead entailed therein (processing web requests is kind of expensive), this is done in small packages, so as not to trigger an expensive roundtrip to the Google server with each single search hit requested.
Now that the paging and enumerating of search results is clear, let’s focus on the implementation details of (web) requesting these from the Google servers.
Handling and Processing Web Requests to Google
Processing web requests to the Google servers is delegated to more specialized methods, as seen in lines 42 to 48 in the previous snippet. The implementation details are as follows:
AssembleQueryUri(...) method is responsible for assembling the actual query Url that is requested, properly url-encoding its parameters (the search
query saved initially as well as some paging parameters).
InstantiateWebRequest(...) then creates the web request, which is sent and has its response data parsed by
Actually, the parsing of the HTML data returned by Google is done by yet another specialized class named
SearchResultParser (see line 57).
Processing Google’s Results
SearchResultParser is responsible for processing the results retrieved from Google, which we get in the form of HTML data. Processing is done in the following snippet:
The name Parser in
SearchResultParser is a bit misleading. While the class acts as a parser (sifting through all the redundant and boilerplate HTML data returned to extract the search hits relevant to us), it does not actually implement one. What it does is take a search query result (a single Google results page, if you will), matching its embedded search hits with the help of a regular expression (line 4). That expression was designed by manually inspecting the HTML code returned by Google, which means it is susceptible to breaking should Google decide to alter their search result content.
Caveats and Areas for Improvement
To wrap up this post, let me point out some caveats and areas for improvement:
While the code discussed is a cheap way of querying Google from within C#, it also presents a very naïve solution to our problem. While the Custom Google Search API acts as an explicit contract between your code (as the client) and Google’s service, our approach here makes use of implicit assumptions when “parsing” search results through a regular expression. In other words, while Google’s official API acts as a reliable public interface, our solution relies on implementation details. Should Google decide to reformat or change the HTML structure of their query result pages, our regular expression is very likely to fail.
Instead of using regular expressions (which are not the best option when it comes to disentangling loosely defined data structures such as HTML), we might implement a real parser to extract the search results from the data retrieved. But that is not a trivial task and might be overkill in most cases. Alternatively, we could make use of HTML parsers like Html Agility Pack. Coupled with some heuristics, the stability of our solution could increase substantially.
Another problem that can be spotted in our code is its dependency on Google’s goodwill. Just like any web server, Google’s server is free to reject any client request it receives. Currently, we are authenticating our web client through a custom User Agent string, as seen in the following snippet:
So far, Google plays along just fine, but it might just cut us off any time. We could then fake a real user agent (such as Microsoft’s Internet Explorer), but that would run counter to being a good netizen and is generally (and rightly so) frowned upon.
There’s nothing really we can do about this situation other than playing nice with Google (by authenticating sincerely) and hoping that it will keep serving our requests in the future.
Source Code on GitHub
Please grab the source bits from GitHub, including the snippets above and some additional helper classes.