Patent searching is still a profession where there aren’t many formal educational opportunities, and learning generally comes with experience. However, no matter how skilled the searcher, the quality of the search is only as good as the coverage of the resources queried. Searchable full text patent collections are easy to come by if you’re interested in certain authorities, such as the US, EP, and WO/PCT publications (check FreePatentsOnline, esp@cenet, and Patent Lens just to name a few), but much more scarce when it comes to Asian collections. In this post, I’ll be comparing four commercial search systems with regard to their searchable Japanese coverage: Minesoft PatBase, Questel’s QPAT and orbit.com platforms, Thomson Reuters Thomson Innovation , and LexisNexis TotalPatent.
Before I get to the comparison, just a quick note: although I have created this post by summarizing publicly available information vendor system help files and, in some cases, statements made from vendor representatives, readers should verify coverage with providers before basing any purchasing decisions on this information.
To start, it’s worth noting that all of these systems contain the EPO’s DOCDB bibliographic and family file (historically known as the INPADOC bibliographic file), which means that they all contain English language abstracts from the Patent Abstracts of Japan (PAJ) collection. This collection extends back to 1976 for some technology areas (for more information about PAJ coverage, an excellent source is the 2006 work “Information Sources in Patents,” 2nd edition, by Stephen R. Adams). The delay from when a Japanese patent application is published to when its hand-translated English abstract appears in the PAJ collection is approximately four months. Some search systems, including Questel’s QPAT/orbit.com platform, Minesoft’s PatBase, and the JP-NETe system, load machine translated English abstracts into their database soon after publication, to be replaced with hand translated abstracts from the PAJ collection as they become available.
One of the most interesting elements of this comparison is that each of these systems has taken a different approach to providing Japanese data so that it can be searched by native English speakers. Both Thomson Reuters and Questel have decided to approach this problem by creating full text collections of searchable English-language machine translations of these patent documents. The benefit to this approach is that users can query the patent full text with English keywords; however, the obvious downside is that machine translation technology can produce wildly imperfect translations in most cases. Both these collections are referred to as “machine-aided” or “machine-assisted” because of adjustments to the text that are made during pre and post-processing:
- Thomson Reuters has attempted to mitigate machine translation quality problems by employing a team of editors to review each translated document and provide hand corrections to un-translated terms. In addition, they use in-house machine translation software which inserts multiple synonyms for the major document keywords, which has the effect of making the documents “broader targets” – by inserting keyword variants, it’s more likely that the document will come up in search results over a wide population of searchers. As a real life example, a standard machine translation service might produce the sentence:
“The wing section which has the first transition section attached in the body.”
While the Thomson Reuters software would produce the following sentence, with multiple keyword options:
“The blade|wing|shuttlecock part which has the front-edge part attached to the main body.”
- Questel has attempted to mitigate machine translation quality problems by offering “machine-assisted” translations. Where Thomson Reuters uses human editors, Questel is using a totally machine-based approach designed by longtime partner Lingway, a company that specializes in linguistic technologies. According to Questel representatives, a number of approaches are used to ensure high-quality translations:
- The translation software relies on “proprietary, manually built and domain-specific dictionaries that have been enhanced by Lingway’s linguistic technologies.”
- A mix of different machine translation software is used, with the choice of specific software dependent on the type of text to be translated.
- The software uses a hybrid machine translation (HMT) approach that “leverages the strengths of statistical and rule-based translation methodologies.”
Questel also states that it replaces machine translations with hand translated information when it becomes available from other sources; we can assume that this refers to the hand translated abstracts from the Patent Abstracts of Japan file. In addition, Questel produces hand translated assignee names for its Japanese (and Korean) collections. Finally, Questel representatives have stated that they plan to add a keyword-searchable original-language collection of Japanese full text records in the future.
Minesoft’s PatBase offers the full text of the patent documents in the original Japanese, alongside their English language bibliographic data and abstracts from the INPADOC file. As the PatBase product also offers a Japanese language interface, it’s reasonable to assume that this decision was made in part to make their product more useful to Japanese consumers. Although English translations can be produced from this full text “on-the-fly,” the Japanese full text is not keyword-searchable. Another consideration is that due to the structure of the underlying PatBase data, it is quite easy to perform a search over all the full text collections available in the system, and limit the results to only those with Japanese family members. Thus, a search through the full text of English-language family members can be used as a substitute for a full text search of Japanese documents, with the obvious drawback that the search will miss any Japanese documents of interest that do not yet have published full text English-language family equivalents. In addition, because the INPADOC family structure is used, it’s also possible that the JP family member of an English document of interest may be only distantly related with regard to claimed content.
Finally, the LexisNexis TotalPatent product offered only bibliographic and abstract coverage of the Japanese collection until the end of 2009/beginning of 2010, when a full text original language collection was loaded. Although the coverage page for the TotalPatent product states that some machine translated data is present for these documents, experimental queries show that keyword-searchable full text English machine translations are not yet available for this collection; a call to the LexisNexis Help Desk also confirmed that full text machine translations are not yet available. In addition, TotalPatent does not yet appear to be able to handle Japanese-language keyword queries; only the English abstracts from the Patent Abstracts of Japan collection are searchable in TotalPatent at this time.
In the next post in this series, I’ll compare the Japanese collections offered by these systems by kind code, available text, language and earliest year of coverage. If you’d like more information about any of these search products, you might be interested in the Search System Reports and Quick Table Comparisons available on Intellogist.com.