Thursday, May 31, 2012

See which papers gain new citations

Used an hour to write a 93-line python code, parsing inspires to see which papers got new citations since last check. Works like this:
1 ( 2 )  Quasi-Single Field Inflation with Large Mass
2 ( 2 )  Odd-dimensional de Sitter Space is Transparent
3 ( 2 )  Chain Inflation Reconsidered
4 ( 8 )  Angular 21 cm Power Spectrum of a Scaling Distribution of Co
5 ( 111 )  Dark Energy

... ...
39 ( 12 )  Inflation with High Derivative Couplings

Papers with citations:  39 ;  Citations:  1083

110  =>  111  in  Dark Energy
Save citations (default yes, type n otherwise)?
Citations saved.

Note added: The below code runs in python3. As a kind tester told me, it is not well behaved in python2. To have it work, change saveQ = input(...) to saveQ = raw_input(...). Also optionally you had better remove the "()" following "print" to get the correct output format in python2.

The mechanism is to run a shell warper to download and save inspire webpage to file. Then use python to parse it. Here is the code in case you are interested in. Unfortunately there are a few hardcore path names -- I am just too lazy to change them ^_^

Shell script:

#!/bin/bash
LOCALPWD=/home/wangyi/Dropbox/local/check_citation
wget "https://inspirehep.net/search?ln=en&ln=en&p=author%3AY.Wang.39&of=hb&action_search=Search&sf=&so=d&rm=&rg=100&sc=0" -O $LOCALPWD/inspires.html
python $LOCALPWD/cc.py $LOCALPWD/inspires.html

Python code:

import re
import sys
import pickle

def cut_page(page):
    paras = re.findall(r'<!C-START REC 11.Brief--!>.+?<abbr class="unapi-id"', page, re.DOTALL)
    last = re.findall(r'.+(<!C-START REC 11.Brief--!>.+)', page, re.DOTALL)[0]
    paras.append(last)
    return paras
    

def get_citations(page):
    paragraphs = cut_page(page)
    citations = [];
    for para in paragraphs:
        re_match = re.findall(r'<a class = "titlelink" href=".+?">(.+?)\.<.+?<br/>e-Print: <b>(.+?)</b>.+?Cited by ([0-9]+?) record', para, re.DOTALL)
        if re_match != []:
            citations.append(re_match[0])
    return citations

def sum_citations(citations):
    sum = 0
    for item in citations:
        sum = sum + eval(item[2])
    return sum

def get_htm_string(fn):
    try:
        htmfile = open(fn)
    except IOError:
        code_exit("Error: input file " + htm_file_name + " not found. Abort.")
    htm_string = htmfile.read()
    htmfile.close()
    return htm_string

def save_citations(citations, last_citations):
    saveQ = input("Save citations (default yes, type n otherwise)? ")
    if (saveQ != 'n' and saveQ != 'N'):    
        db_file = open('/home/wangyi/Dropbox/local/check_citation/citations.dat','wb')
        pickle.dump(citations,db_file)
        db_file = open('/home/wangyi/Dropbox/local/check_citation/last_citations.dat','wb')
        pickle.dump(last_citations,db_file)
        print ('Citations saved.')
        return
    print ('Citations not saved.')


def load_last_citations():
    try:
        db_file = open(
            '/home/wangyi/Dropbox/local/check_citation/citations.dat'
            ,'rb')
    except IOError:
        return []
    return pickle.load(db_file) 


def compare(citations, last_citations):
    newsQ = False
    for item in citations:
        match = False
        for last_item in last_citations:
            if item[1] == last_item[1]:
                match = True
                break
        if match == False:
            print("New paper: ", item[0])
            newsQ = True
            continue
        if item[2] != last_item[2]:
            print (last_item[2]," => ", item[2]," in ", item[0][:60])
            newsQ = True
    return newsQ

def print_stat(citations):
    for n in range(len(citations)):
        print (n+1, "(", citations[n][2],") ",citations[n][0][:60])
    print()
    print("Papers with citations: ", len(citations), 
          ";  Citations: ", sum_citations(citations))
    print()


##### main #####

htm_string = get_htm_string(sys.argv[1])
citations = get_citations(htm_string)
print_stat(citations)

last_citations = load_last_citations()
newsQ = compare(citations, last_citations)
if newsQ == True:
    save_citations(citations, last_citations)

No comments:

Post a Comment