Problems on lda2vec
Nov 12, 2018
2864
lda2vec淌過得坑。
https://github.com/cemoody/lda2vec
Example code run process
Preprocess
A package should be installed
- Make sure these packages are installed
- pyxDamerauLevenshtein
- gensim
- h5py
- Due to a function were changed with the package update (https://github.com/gfairchild/pyxDamerauLevenshtein/blob/f657916e4b0db18935c9e32f7dd3c98df95bc15a/CHANGES.md#L5)
Renamed damerau_levenshtein_distance_withNPArray to damerau_levenshtein_distance_ndarray - Change the deprecated function call in corpus.py, then re-run “python setup.py install”
Line1
model = Word2Vec.load_word2vec_format(filename, binary=True)
must be1
model = gensim.models.KeyedVectors.load_word2vec_format(filename, binary=True)
under newer version of gensim
- Update the pre-train word vector model path
Line1
fn_wordvc = 'GoogleNews-vectors-negative300.bin.gz'
https://github.com/cemoody/lda2vec/issues/45
https://github.com/cemoody/lda2vec/pull/46
https://github.com/gfairchild/pyxDamerauLevenshtein/blob/f657916e4b0db18935c9e32f7dd3c98df95bc15a/CHANGES.md#L5Train
- You may need to confirm that “dill” is installed
1 pip install dillhttps://stackoverflow.com/questions/51597073/python-no-module-named-dill-while-using-pickle-load
- You will need to downgrade chainer to version 1.24.0
1 pip install chainer==1.24.0https://github.com/chainer/chainer/issues/4708
- You must change a code in lda2vec’s preprocessing.py, then re-run “python setup.py install”
Overwrite or replace the original function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38 def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None, **kwargs):
""""""
if nlp is None:
nlp = English()
data = np.zeros((len(texts), max_length), dtype='int32')
data[:] = skip
bad_deps = ('amod', 'compound')
for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
if merge:
# from the spaCy blog, an example on how to merge
# noun phrases into single tokens
for phrase in doc.noun_chunks:
# Only keep adjectives and nouns, e.g. "good ideas"
while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
phrase = phrase[1:]
if len(phrase) > 1:
# Merge the tokens, e.g. good_ideas
phrase.merge(phrase.root.tag_, phrase.text,
phrase.root.ent_type_)
# Iterate over named entities
for ent in doc.ents:
if len(ent) > 1:
# Merge them into single tokens
ent.merge(ent.root.tag_, ent.text, ent.label_)
dat = doc.to_array([attr, LIKE_EMAIL, LIKE_URL]).astype('int32')
if len(dat) > 0:
dat = dat.astype('int32')
msg = "Negative indices reserved for special tokens"
assert dat.min() >= 0, msg
# Replace email and URL tokens
idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
dat[idx] = skip
length = min(len(dat), max_length)
data[row, :length] = dat[:length, 0].ravel()
uniques = np.unique(data)
vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
vocab[skip] = '<SKIP>'
return data, vocab
and make sure Spacy’s version is 1.91
2pip install spacy==1.9
python -m spacy download en
https://github.com/cemoody/lda2vec/issues/38
- You must have a valid and suitable package “Cupy”
Follow the instruction in below’s link to install it
http://docs-cupy.chainer.org/en/stable/install.html#install-cupy