lda2vec淌過得坑。
https://github.com/cemoody/lda2vec

Example code run process

Preprocess

A package should be installed

Make sure these packages are installed
- pyxDamerauLevenshtein
- gensim
- h5py
Due to a function were changed with the package update (https://github.com/gfairchild/pyxDamerauLevenshtein/blob/f657916e4b0db18935c9e32f7dd3c98df95bc15a/CHANGES.md#L5)
Renamed damerau_levenshtein_distance_withNPArray to damerau_levenshtein_distance_ndarray
Change the deprecated function call in corpus.py, then re-run “python setup.py install”
Line
1
model = Word2Vec.load_word2vec_format(filename, binary=True)

must be

1	model = gensim.models.KeyedVectors.load_word2vec_format(filename, binary=True)

under newer version of gensim

Update the pre-train word vector model path
Line

1	fn_wordvc = 'GoogleNews-vectors-negative300.bin.gz'

https://github.com/cemoody/lda2vec/issues/45
https://github.com/cemoody/lda2vec/pull/46
https://github.com/gfairchild/pyxDamerauLevenshtein/blob/f657916e4b0db18935c9e32f7dd3c98df95bc15a/CHANGES.md#L5

Train

You may need to confirm that “dill” is installed
1
pip install dill

https://stackoverflow.com/questions/51597073/python-no-module-named-dill-while-using-pickle-load

You will need to downgrade chainer to version 1.24.0
1
pip install chainer==1.24.0

https://github.com/chainer/chainer/issues/4708

You must change a code in lda2vec’s preprocessing.py, then re-run “python setup.py install”
Overwrite or replace the original function

def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None, **kwargs):
    """"""
    if nlp is None:
        nlp = English()
    data = np.zeros((len(texts), max_length), dtype='int32')
    data[:] = skip
    bad_deps = ('amod', 'compound')
    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
        if merge:
            # from the spaCy blog, an example on how to merge
            # noun phrases into single tokens
            for phrase in doc.noun_chunks:
                # Only keep adjectives and nouns, e.g. "good ideas"
                while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
                    phrase = phrase[1:]
                if len(phrase) > 1:
                    # Merge the tokens, e.g. good_ideas
                    phrase.merge(phrase.root.tag_, phrase.text,
                                 phrase.root.ent_type_)
            # Iterate over named entities
            for ent in doc.ents:
                if len(ent) > 1:
                    # Merge them into single tokens
                    ent.merge(ent.root.tag_, ent.text, ent.label_)
        dat = doc.to_array([attr, LIKE_EMAIL, LIKE_URL]).astype('int32')
        if len(dat) > 0:
            dat = dat.astype('int32')
            msg = "Negative indices reserved for special tokens"
            assert dat.min() >= 0, msg
            # Replace email and URL tokens
            idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
            dat[idx] = skip
            length = min(len(dat), max_length)
            data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab

and make sure Spacy’s version is 1.9

1 2	pip install spacy==1.9 python -m spacy download en

https://github.com/cemoody/lda2vec/issues/38

You must have a valid and suitable package “Cupy”
Follow the instruction in below’s link to install it
http://docs-cupy.chainer.org/en/stable/install.html#install-cupy