Data-Mining the Foundational Patents of Photovoltaic Materials: An Application of Patent Citation Spectroscopy

Comins and Leydesdorff: Data-Mining the Foundational Patents of Photovoltaic Materials: An Application of Patent Citation Spectroscopy



In his presidential address to the American Economic Association entitled “Productivity, R&D and the Data Constraint”, Grilliches[3] formulated as follows:

Our measurement frameworks are not set up to record detailed origin and destination data for commodity flows, much less so for information flows. We do have now a new tool for studying some of this: citations to patents and the scientific literature,[4] but anyone currently active in the e-mail revolution and participating in the conferences and workshops circuit knows how small this tip is relative to the informal-communications iceberg itself.

In this study, we report on a routine (Patent Citation Specroscopy or PCS) for using citations among patents to trace the foundational patent so that one can more easily reconstruct the trajectory of a technology from its origin to the present time. Hitherto, subject matter experts have to review patents and patent applications, maintain an awareness of the most technologically important patents. This has remained a time-consuming practice which presents several obstacles including difficultly with reliability and replication and dependence on the availability of experts.[5]

The problem is not only one of sufficient (wo) man power and skills. The huge database is not easily accessible for retrieval and reconstruction. Jensen and Murray,[6] for example, argue that the impact of gene patents on downstream research and innovation are unknown, in part because of a lack of empirical data on the extent and nature of gene patenting. The intellectual property rights for some genes can become highly fragmented between many owners, which suggests that downstream innovators may face considerable costs to gain access to gene-oriented technologies. Konski and Spielthenner[7] developed a landscape analysis of stem-cell patents using a clustering algorithm based on network analysis enabling the user to find “bridging” patents between technological developments.

With the support of the Office of the Chief Economist in the US Patent and Trademark Office (USPTO), PatentsView was launched in 2015 as a new patent data visualization and analysis platform intended to increase the value, utility and transparency of US patent data. The PatentsView platform is built on USPTO’s regularly updated database that longitudinally links inventors, their organizations, locations and overall patenting activity. PatentsView delivers US patent data in ways that enable this data to be fully discoverable and exploitable by various end users. Our algorithmic method for Patent Citation Spectroscopy (PCS) exploits PatentsView data and enables the user to identify landmark patents interactively via a web-application (url:

Retrieval and Disclosure: Patents as Indicators

Beyond their critical role in industry, patents are indicators of inventions and thus can be expected to carry information about technological progress.[4,8] (O’Donoghue et al., 1998; Harhoff et al., 1999; Artz et al., 2010; Graevenitz et al., 2013). Patents provide a unique window on knowledge-based economies[9] and can serve as both an indicator of industrial activity and output of academia.[10] The United States Patent and Trademark Office (USPTO) under the Department of Commerce plays a vital role in relating university and industry in the American innovation system by registering and extending legal protection over inventions. In exchange for detailed public disclosure of a technical invention, the patent assignee, the legal entity to which intellectual property rights are assigned, is entitled to a monopoly over the patent’s claims.

Let us as an example demonstrate the effectiveness of PCS for the retrieval by conducting an analysis of the seminal patents for the material technologies underlying photovoltaic cells. As R&D in photovoltaic materials matures, increases in energy efficiency and decreases in production costs could enable a significant impact on the global energy sector.[1-2] Intellectually, this study follows up on Leydesdorff, Alkemade, Heimeriks and Hoekstra’s[11] study of the innovation dynamics of photovoltaic cells providing an animation of geographical diffusion at (See also at for instruction.) In that study, however, we focused only on “dye sensititzed solar cells” (CuInSe2-based cells), its geographical diffusion and technological branching from the perspective of technology studies and regional economics. In this study, we do not follow the time axis, but look back in order to retrieve a starting point for the evolving technology. Furthermore, we extend the analysis to the nine classifications added to the patent classifications for photovoltaic cells (Table 1). To do so, we leverage the taxonomy of the recently renewed patent classification system, known as the Cooperative Patent Classification (CPC).

Table 1

Nine classes of photovoltaic cells in CPC.

Y02E 10/541CuInSe2 material PV cells
Y02E 10/542Dye sensitized solar cells
Y02E 10/543Solar cells from Group II-VI materials
Y02E 10/544Solar cells from Group III-V materials
Microcrystalline silicon PV cells
Y02E 10/545
Y02E 10/546Polycrystalline silicon PV cells
Y02E 10/547Monocrystalline silicon PV cells
Y02E 10/548Amorphous silicon PV cells
Y02E 10/549Organic PV cells

While there are numerous patent classification systems, among the most widely-used in patent studies are hitherto the United States Patent Classification (USPC) system, which comprises more than 160,000 classes and subclasses of patent functions (USPTO, 2008), its European counterpart (ECLA of the European Patent Office EPO) and the International Patent Classification (IPC) system, a hierarchical system managed by the World Intellectual Property Organization (WIPO) consisting of more than 70,000 classifications of technical fields (WIPO, 2014). In 2013, the USPTO and the EPO adopted a new classification system for patents that will ultimately replace both the USPC and IPC. The CPC system of these two large agencies provides a tree-like hierarchy consisting of 5-levels of depth and more than 250,000 classifications at the level of the leaf node and is currently in use for patents filed through the USPTO as well as EPO.

Furthermore, CPC adds to the previous systems by the introduction of the Y-class of patents representing newly emerging technologies across sectors. The new classes are backtracked into the previous system. Currently, there are nine CPC classifications that describe material photovoltaic technologies (Table 1).

We extend our understanding of the performance of PCS by applying the methodology for each of these classifications using the advanced search capability of the online tools of PatentsView and PCS. Below we first briefly review the PCS methodology and tool, then describe our findings pertaining to the landmark patents underlying photovoltaic material technologies.

Patent Citation Spectroscopy

PCS is a data mining method that operates over the cited references within sets of patents. The goal is to generate a historical assessment of the most impactful patents within technological areas. The underlying PCS computation is based on a similar data mining methodology developed for use on academic literature, known as Reference Publication Year Spectroscopy (RPYS) technique.[12] This method involves aggregating the cited references across a set of retrieved documents and organizing these cited references by their publication year. For each cited reference year, the total number of references is calculated. Next, data is de-trended by taking the absolute deviation of the number of cited references for a given year from the 5-year median. As specifically applied to patents, this is represented by the equation:


Where C represents the total sum of citations to patents granted in year t and med represents the median. These steps do not deviate from RPYS in calculation (though RPYS was never applied to patents). However, this de-trending function only considers the aggregated cited reference activity over time. This creates a challenge in identifying seminal works because interesting outliers resulting from the de-trending equation could result from either a large surge in the influence of a single document (i.e., what we might consider a seminal work) or based on several slightly influential documents occurring in the same year. As such, PCS includes an additional normalization calculation to disentangle outliers based on the outstanding performance of a single document as compared to a group of documents:

PCS(t)=f(t)×Count of References to Most Referenced Patent in Year tCt

This step multiples the results from equation (1) based on the percentage of all references from that year attributable to the most referenced patent.

Applying Patent Citation Spectroscopy to Material Photovoltaic Technologies

At present, PCS can be applied to granted US patents using a web-application produced by Comins et al. (under review; The web-application leverages the application programming interface (API) to the public data platform PatentsView, which is a supported by the USPTO Chief Economist. Users can search for patents using either keyword phrases (e.g., “photovoltaic cells”) or more advanced searches. These advanced searches follow the conventions described by the data-provider (PatentsView) documentation. Among other things, advanced search queries enable users to apply PCS to patents based on their Cooperative Patent Classification.

Using the PCS web-application, we conducted a search for the seminal patents of the nine CPC subclasses pertaining to photovoltaic solar cells. Here, we walk through the analytic routine for a single case (CPC subclass Y02E 10/541: CuInSe2 material PV cells). In this case, an advanced search was conducted in the PCS-application using the following query: ADVANCED= {“cpc_subgroup_id”:”Y02E10\/541”}. This search retrieved metadata on 962 granted US patents and analyzed a total of 3,502 unique patent references. The application yields a visualization of the PCS algorithm output as well as the method’s most likely seminal patent (see Figure 1). In the case of CPC subclass Y02E 10/541, the resulting seminal patent is US4335266: “Methods for forming thin-film heterojunction solar cells from I-III-IV2” by Reid Mickelsen and Wen Chen.

Figure 1

The PCS-derived foundational patent for CPC Subclass Y02E 10/541: CuInSe2 material PV cells is US4335266: “Methods for forming thin-film heterojunction solar cells from I-III-IV2” by Reid Mickelsen and Wen Chen.

To validate the results of the algorithm, we conduct a search for scholarly articles citing patent US4335266 as the underlying invention of CuInSe2 material PV cells. In this instance, an article appearing in Materials Science Forum states “…in 1980, Boeing Aerospace demonstrated, for the first time, the milestone of 10 % small-area cell efficiency in the form of thin-film solar cells with a CuInSe2 alloy system, in which they successfully invented how to prepare the p-type absorbers known as so-called ‘bilayer’ process [Mickelsen and Chen, US4335266].” Such articles provide corroborating evidence as the performance of PCS.[11]

This foundational patent was granted in 1982 and is cited 151 times since then in other USPTO patents. Figure 2 shows the time series of the patents building on US4335266 and broken down for country names. The number of co-inventors is 351, of which 56 from Japan, 10 from Taiwan and 273 from the USA. However, 82% of the applicants are American. The Japanese and Taiwanese efforts during the period 1995-2005 were perhaps too early. The main applications followed after three decades in the US during the years 2010-2015.

Figure 2

Geographical spread over time of the 151 US patents citing the foundational patent of CuInSe2 material PV cells.

In sum, the technique enables us to find the foundational patent and to pursue the analysis from there. A summary of results for all nine photovoltaic subclasses examined in our study are detailed in Table 2. Once a seminal patent for a given topic was identified, a traditional search of scholarly literature was performed to identify corroborating evidence of the patent’s influence. In the case of 5 of the 9 subgroups analyzed, we found corroborating scholarly evidence for the seminal patents identified by the PCS algorithm.

Table 2

Summary results from our application of the PCS algorithm to 9 CPC subclasses related to photovoltaic technology. Table provide corroborating evidence, where available, for the PCS TOOL.

Patent TopicCPC SubgroupPCS Identified Seminal PatentCorroborating Evidence
CuInSe2 material PV cellsY02E10\/541US4335266Kushiya et al.[13]
Dye sensitized solar cellsY02E10\/542US4927721Longo and Paoli[14]
Solar cells from Group II-VI materialsY02E10\/543US5536333Cheese et al.[15]
Solar cells from Group III-V materialsY02E10\/544US6252287Takamoto et al.[16]
Microcrystalline silicon PV cellsY02E10\/545US5677236none
Polycrystalline silicon PV cellsY02E10\/546US5227329none
Monocrystalline silicon PV cellsY02E10\/547US5053083none
Amorphous silicon PV cellsY02E10\/548US4109271none
Organic PV cellsY02E10\/549US4539507Kiy[17]


We used Patent Citation Spectroscopy—originally developed as Reference Publication Year Spectroscopy (RPYS)— for studying landmarks and milestones in scientific literature[18-19] to patent literature classified into the nine Y-subclasses of CPC that describe material photovoltaic technologies. In five of the nine cases, we found corroborating evidence for the foundational character of the patent indicated by the routine.

The possible applications of PCS are numerous. In a scholarly context, one can be interested in the reconstruction of the main path of patent citations.[20] Patents branch out in tree-like structures along trajectories. The root patent can be followed historically using sequences of patent citations. The algorithmic method for Patent Citation Spectroscopy (PCS) presented in this study provides solution to the problem where to begin the analysis of a technogical development. PCS enables the user to retrieve the fundamental patent in any technological domain using a topical search. This application thus orients the user strategically.[21]

For this study we extended the routine with the option to use the advanced search queries at PatentsView. On the basis of two normalizations of the longitudinal distribution of the publication years of the patents cited by the retrieved patents, the routine (at provides a best guess of the foundational patent for the subject specified in the string. It seems to us that the successful application in five of the nine cases and the previous results in the case of biomedical patents reported by Comins et al.[18] provide some confidence that this indicator of fundamental patents has potential. However, the normalizations may have to be refined based on further analysis of successful and unsuccessful applications.

Follow-up studies could combine the results of PCS with the longitudinal animations demonstrated at, but since further developed into a stand-alone tool PatViz. (The latest release of PatViz can be downloaded from or for installation on one’s own machine. One can upload one’s own data to this routine. For example, one can first retrieve the source of a technology using PCS and then follow the citations.



Chu S, Cui Y, Liu N , authors. The path towards sustainable energy. Nature materials. 2017;16(1):16


Polman A, Knight M, Garnett EC, Ehrler B, Sinke WC , authors. Photovoltaic materials: Present efficiencies and future challenges. Science. 2016;352:6283


Griliches Z , author. Productivity, R&D and the Data constraint. American Economic Review. 1994;84(1):1–23


Jaffe AB, Trajtenberg M, Henderson R , authors. Geographic localization of knowledge spillovers as evidenced by patent citations. the Quarterly journal of Economics. 1993;108(3):577–98


Cockburn IM, Kortum SS, Stern S , authors. 2002. Are All Patent Examiners Equal?The Impact of Examiner Characteristics (Vol. Working Paper 8980., MA: NBER;


Jensen K, Murray F , authors. Intellectual property landscape of the human genome. Science. 2005;310(5746):239–40


Konski AF, Spielthenner DJ , authors. Stem cell patents: a landscape analysis. Nature biotechnology. 2009;27(8):722


Comins JA, Carmack SA, Leydesdorff L , authors. Patent Citation Spectroscopy (PCS): Algorithmic retrieval of landmark patents. arXiv preprint arXiv:1710.03349.


Jaffe AB, Trajtenberg M , authors. Patents, Citations and Innovations: A Window on the Knowledge Economy. Cambridge, MA/London: MIT Press; 2002


Shelton RD, Leydesdorff L , authors. Publish or Patent: Bibliometric evidence for empirical trade-offs in national funding strategies. Journal of the American Society for Information Science and Technology. 2012;63(3):498–511


Leydesdorff L, Alkemade F, Heimeriks G, Hoekstra R , authors. Patents as instruments for exploring innovation dynamics: geographic and technological perspectives on “photovoltaic cells”. Scientometrics. 2015;102(1):629–51. doi: 10.1007/s11192-014-1447-8.


Marx W, Bornmann L, Barth A, Leydesdorff L , authors. Detecting the historical roots of research fields by reference publication year spectroscopy (RPYS). Journal of the Association for Information Science and Technology. 2014;65(4):751–64


Kushiya K, Sugimoto H, Chiba Y, Tanaka Y, Hakuma H , authors. Change of the Characterization Techniques as Progress of CuInSe2-based Thin-Film PV Technology. In Materials Science Forum. Trans Tech Publications. 2012;725:165–70


Longo C, Paoli DMA , authors. Dye-sensitized solar cells: a successful combination of materials. Journal of the Brazilian Chemical Society. 2003;14(6):898–901


Cheese E, Mapes M K, Turo KM, Jones AR , authors. US department of energy photovoltaics research evaluation and assessmentIn Photovoltaic Specialists Conference (PVSC). 2016 IEEE 43rd. 2016. p. 3475–80. IEEE;


Takamoto T, Agui T, Washio H, Takahashi N, Nakamura K, Anzawa O , authors. et al. Future development of InGaP/(In) GaAs based multijunction solar cellsIn Photovoltaic Specialists Conference, 2005. Conference Record of the Thirty-first IEEE. 2016. p. 519–24. IEEE;


Kiy M , author. Charge injection and transport in organic semiconductors. Doctoral dissertation. 2002


Comins JA, Leydesdorff L , authors. Citation algorithms for identifying research milestones driving biomedical innovation. Scientometrics. 2017;110(3):1495–504


Thor A, Marx W, Leydesdorff L, Bornmann L , authors. Introducing CitedReferencesExplorer : A program for Reference Publication Year Spectroscopy with Cited References Disambiguation. Journal of Informetrics. 2016;10(2):503–15. doi: 10.1016/j.joi.2016.02.005.


Liu JS, Lu LY , authors. An integrated approach for main path analysis: Development of the Hirsch index as an example. Journal of the American Society for Information Science and Technology. 2012;63(3):528–42


Rotolo D, Rafols I, Hopkins MM, Leydesdorff L , authors. Strategic intelligence on emerging technologies: Scientometric overlay mapping. Journal of the Association for Information Science and Technology. 2017;68(1):214–33. doi: 10.1002/asi.23631.