lang2.zip (about 266 MB)
File | Size (Billions of Bytes) | Description |
---|---|---|
map.old_to_new.i | 9.28 | A sequence of np.int32 (maps corpus ids to "new" ids) |
map.new_to_old.i | 0.48 | A sequence of np.int32 (maps "new" ids to corpus ids) |
embedding.f | 133.46 | A sequence of np.float32 |
import numpy as np
import os
def map_int32(fn):
fn_len = os.path.getsize(fn)
return np.memmap(fn, dtype=np.int32, shape=(int(fn_len/4)), mode='r')
old_to_new = map_int32('map.old_to_new.i')
new_to_old = map_int32('map.new_to_old.i')
The floats can be loaded into python with this:
def map_float32(fn):
fn_len = os.path.getsize(fn)
return np.memmap(fn, dtype=np.float32, shape=(int(fn_len/4)), mode='r')
M = map_float32('embedding.f').reshape(-1, 280)
M has 280 hidden dimensions.
Let oid and nid be the old and new ids, respectively. To obtain a vector for a corpus id (oid), use the old_to_new mapping file to convert the oid to a nid, and then select the row from M with M[nid,:]
To check your work, compare this with
M[old_to_new[9558665],:]