lda2vec淌過得坑。
https://github.com/cemoody/lda2vec

Example code run process

Preprocess

A package should be installed

must be

1
model = gensim.models.KeyedVectors.load_word2vec_format(filename, binary=True)

under newer version of gensim

  • Update the pre-train word vector model path
    Line
    1
    fn_wordvc = 'GoogleNews-vectors-negative300.bin.gz'

https://github.com/cemoody/lda2vec/issues/45
https://github.com/cemoody/lda2vec/pull/46
https://github.com/gfairchild/pyxDamerauLevenshtein/blob/f657916e4b0db18935c9e32f7dd3c98df95bc15a/CHANGES.md#L5

Train

  • You may need to confirm that “dill” is installed
    1
    pip install dill

https://stackoverflow.com/questions/51597073/python-no-module-named-dill-while-using-pickle-load

  • You will need to downgrade chainer to version 1.24.0
    1
    pip install chainer==1.24.0

https://github.com/chainer/chainer/issues/4708

  • You must change a code in lda2vec’s preprocessing.py, then re-run “python setup.py install”
    Overwrite or replace the original function
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None, **kwargs):
    """"""
    if nlp is None:
    nlp = English()
    data = np.zeros((len(texts), max_length), dtype='int32')
    data[:] = skip
    bad_deps = ('amod', 'compound')
    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
    if merge:
    # from the spaCy blog, an example on how to merge
    # noun phrases into single tokens
    for phrase in doc.noun_chunks:
    # Only keep adjectives and nouns, e.g. "good ideas"
    while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
    phrase = phrase[1:]
    if len(phrase) > 1:
    # Merge the tokens, e.g. good_ideas
    phrase.merge(phrase.root.tag_, phrase.text,
    phrase.root.ent_type_)
    # Iterate over named entities
    for ent in doc.ents:
    if len(ent) > 1:
    # Merge them into single tokens
    ent.merge(ent.root.tag_, ent.text, ent.label_)
    dat = doc.to_array([attr, LIKE_EMAIL, LIKE_URL]).astype('int32')
    if len(dat) > 0:
    dat = dat.astype('int32')
    msg = "Negative indices reserved for special tokens"
    assert dat.min() >= 0, msg
    # Replace email and URL tokens
    idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
    dat[idx] = skip
    length = min(len(dat), max_length)
    data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab

and make sure Spacy’s version is 1.9

1
2
pip install spacy==1.9
python -m spacy download en

https://github.com/cemoody/lda2vec/issues/38