Skip to content
Snippets Groups Projects
Commit 154726d8 authored by pabvald's avatar pabvald
Browse files

No-preprocessing results and rankings computed

parent 50e8bfc1
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Comparativa de Word2Vec, GloVe y FastText para obtener la similaridad semántica entre pares de textos
## Pablo Valdunciel Sánchez
%% Cell type:markdown id: tags:
## 1. Modelos
%% Cell type:markdown id: tags:
Utilizamos el modelo *KeyedVectors* de la librería [*gesim*](https://radimrehurek.com/gensim/index.html) para cargar los vectores pre-entrenados de los diferentes modelos
%% Cell type:code id: tags:
``` python
from gensim.models import KeyedVectors
```
%% Cell type:markdown id: tags:
Cargamos los vectores pre-entrenados con modelos Word2Vec, GloVe y FastText. Los vectores pre-entrenados de cada modelo utilizados son:
- **Word2Vec**: [GoogleNews-vectors-negative300.bin.gz](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit)
- **GloVe**: [Common Crawl (840B tokens,2.2M vocab, cased, 300d vectors)](http://nlp.stanford.edu/data/glove.840B.300d.zip)
- **FastText**: [rawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600Btokens)](https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip)
En el caso de los vectores de GloVe, se ha utilizado la función [*gensim.scripts.glove2word2vec.glove2word2vec*](https://radimrehurek.com/gensim/scripts/glove2word2vec.html) para convertir el archivo al formato Word2Vec.
%% Cell type:code id: tags:
``` python
PATH_WORD2VEC = './data/embedding/word2vec/GoogleNews-vectors-negative300.bin'
PATH_GLOVE = './data/embedding/glove/glove.840B.300d.w2v.txt'
PATH_FASTTEXT = './data/embedding/fasttext/crawl-300d-2M.vec'
```
%% Cell type:markdown id: tags:
Cargar los vectores puede llevar varios minutos.
%% Cell type:code id: tags:
``` python
word2vec = KeyedVectors.load_word2vec_format(PATH_WORD2VEC, binary=True)
```
%% Cell type:code id: tags:
``` python
glove = KeyedVectors.load_word2vec_format(PATH_GLOVE, binary=False)
```
%% Cell type:code id: tags:
``` python
fasttext = KeyedVectors.load_word2vec_format(PATH_FASTTEXT, binary=False)
```
%% Cell type:markdown id: tags:
## 2. Datos
%% Cell type:markdown id: tags:
Haciendo uso de las funciones del módulo *load.py* cargamos los conjuntos de test de las tareas STS12, STS13, STS14, STS15 y STS16. Estas funciones llevan a cabo un preprocesamiento de las oraciones según los parámetros que se indiquen. Entre las posibilidades de preprocesamiento se encuentran:
- **lowercase**: hacer que todas las palabras estén en minúscula.
- **stop_words**: eliminar las palabras que no aportan casi significado semántico como determinantes, preprosiciones, etc.
- **punctuation**: elminar los símbolos de puntuación.
- **only_ascci**: eliminar las palabras que no estén formadas por caracteres ASCII.
- **lemmatization**: sustituir las palabras por su lexema.
El preprocesamiento del texto está implementado en la función *preprocess* en el módulo *utils.py*. La función *preprocess* hace uso de la librería [spaCy](https://spacy.io/) para llevar a cabo el preprocesamiento.
%% Cell type:code id: tags:
``` python
from load import load_sts_12, load_sts_13, load_sts_14, load_sts_15, load_sts_16
from load import load_frequencies
```
%% Cell type:markdown id: tags:
En este caso no se aplica ningún tipo de preprocesamiento.
%% Cell type:code id: tags:
``` python
PATH_DATASETS = './data/datasets/STS'
PATH_FREQUENCIES = './data/frequencies.tsv'
PREPROCESSING = {'lowercase': False,
'stop_words': False,
'punctuation': False,
'only_ascii': False,
'lemmatization': False
}
```
%% Cell type:markdown id: tags:
Cargamos también las frecuencias de las palabras en el corpus par poder aplicar el SIF.
%% Cell type:code id: tags:
``` python
freqs = load_frequencies(PATH_FREQUENCIES)
```
%% Cell type:code id: tags:
``` python
sts12 = load_sts_12(PATH_DATASETS, PREPROCESSING)
sts13 = load_sts_13(PATH_DATASETS, PREPROCESSING)
sts14 = load_sts_14(PATH_DATASETS, PREPROCESSING)
sts15 = load_sts_15(PATH_DATASETS, PREPROCESSING)
sts16 = load_sts_16(PATH_DATASETS, PREPROCESSING)
```
%% Output
***** TASK: STS12 *****
Preprocessing -MSRpar-
-MSRpar- preprocessed correctly
Preprocessing -MSRvid-
-MSRvid- preprocessed correctly
Preprocessing -SMTeuroparl-
-SMTeuroparl- preprocessed correctly
Preprocessing -surprise.OnWN-
-surprise.OnWN- preprocessed correctly
Preprocessing -surprise.SMTnews-
-surprise.SMTnews- preprocessed correctly
***** TASK: STS13 (-SMT) ***
Preprocessing -FNWN-
-FNWN- preprocessed correctly
Preprocessing -headlines-
-headlines- preprocessed correctly
Preprocessing -OnWN-
-OnWN- preprocessed correctly
***** TASK: STS14 *****
Preprocessing -deft-forum-
-deft-forum- preprocessed correctly
Preprocessing -deft-news-
-deft-news- preprocessed correctly
Preprocessing -headlines-
-headlines- preprocessed correctly
Preprocessing -images-
-images- preprocessed correctly
Preprocessing -OnWN-
-OnWN- preprocessed correctly
Preprocessing -tweet-news-
-tweet-news- preprocessed correctly
***** TASK: STS15 *****
Preprocessing -answers-forums-
-answers-forums- preprocessed correctly
Preprocessing -answers-students-
-answers-students- preprocessed correctly
Preprocessing -belief-
-belief- preprocessed correctly
Preprocessing -headlines-
-headlines- preprocessed correctly
Preprocessing -images-
-images- preprocessed correctly
***** TASK: STS16 *****
Preprocessing -answer-answer-
-answer-answer- preprocessed correctly
Preprocessing -headlines-
-headlines- preprocessed correctly
Preprocessing -plagiarism-
-plagiarism- preprocessed correctly
Preprocessing -postediting-
-postediting- preprocessed correctly
Preprocessing -question-question-
-question-question- preprocessed correctly
%% Cell type:markdown id: tags:
## 3. Métodos
%% Cell type:markdown id: tags:
Los métodos para calcular la similaridad semántica entre dos oraciones son:
- **avg_cosine**: el vector de una oración se obtiene haciendo la media (*average*) de los vectores de las palabras de esa oración. La similaridad entre dos vectores se calcula utilizando la similitud coseno.
- **avg_cosine**: el vector de una oración se obtiene haciendo la media (*average*) de los vectores de las palabras de esa oración. La similaridad entre dos vectores se calcula utilizando la similitud coseno.
- **wmd**: la similariad entre dos oraciones se calcula como el contrario de ladistancia *Word Mover's Distance* entre las mismas. El modelo *KeyedVectors* de *gensim* incorpora el cálculo de esta distancia.
%% Cell type:code id: tags:
``` python
from functools import partial
from methods import avg_cosine, wmd, sif_cosine
```
%% Cell type:code id: tags:
``` python
METHODS = [
("Word2Vec + AVG", partial(avg_cosine, model=word2vec)),
("Word2Vec + SIF", partial(sif_cosine, model=word2vec, frequencies=freqs, a=0.001)),
("Word2Vec + WMD", partial(wmd, model=word2vec)),
("Word2Vec + SIF", partial(sif_cosine, model=word2vec, frequencies=freqs, a=0.001))
#("GloVe + AVG", partial(avg_cosine, model=glove)),
#("GloVe + WMD", partial(wmd, model=glove)),
#("GloVe + SIF", partial(sif_cosine, model=glove, frequencies=freqs, a=0.001)),
#("FastText + AVG", partial(avg_cosine, model=fasttext)),
#("FastText + WMD", partial(wmd, model=fasttext)),
#("FastText + SIF", partial(sif_cosine, model=fasttext, frequencies=freqs, a=0.001))
("GloVe + AVG", partial(avg_cosine, model=glove)),
("GloVe + SIF", partial(sif_cosine, model=glove, frequencies=freqs, a=0.001)),
("GloVe + WMD", partial(wmd, model=glove)),
("FastText + AVG", partial(avg_cosine, model=fasttext)),
("FastText + SIF", partial(sif_cosine, model=fasttext, frequencies=freqs, a=0.001)),
("FastText + WMD", partial(wmd, model=fasttext))
]
```
%% Cell type:markdown id: tags:
## 4. Evaluación
%% Cell type:code id: tags:
``` python
from utils import evaluate
import pprint
```
%% Cell type:code id: tags:
``` python
word2vec_sts12_pearson, sts12_spearman = evaluate(sts12, METHODS)
word2vec_sts13_pearson, sts13_spearman = evaluate(sts13, METHODS)
word2vec_sts14_pearson, sts14_spearman = evaluate(sts14, METHODS)
word2vec_sts15_pearson, sts15_spearman = evaluate(sts15, METHODS)
word2vec_sts16_pearson, sts16_spearman = evaluate(sts16, METHODS)
```
%% Cell type:code id: tags:
``` python
fasttext_sts16_pearson
sts12_pearson, sts12_spearman = evaluate(sts12, METHODS)
sts13_pearson, sts13_spearman = evaluate(sts13, METHODS)
sts14_pearson, sts14_spearman = evaluate(sts14, METHODS)
sts15_pearson, sts15_spearman = evaluate(sts15, METHODS)
sts16_pearson, sts16_spearman = evaluate(sts16, METHODS)
```
%% Output
{'FastText + AVG': 0.6326237499753964,
'FastText + WMD': 0.6312212885648231,
'FastText + SIF': 0.7315395720676849}
%% Cell type:code id: tags:
``` python
print("++++ Task STS12 ++++")
sts12_pearson
print("\n++++ Task STS12 ++++")
pprint.pprint(sts12_pearson, width=1)
print("\n++++ Task STS13 ++++\n")
pprint.pprint(sts13_pearson, width=1)
print("\n++++ Task STS14 ++++\n")
pprint.pprint(sts14_pearson, width=1)
print("\n++++ Task STS15 ++++\n")
pprint.pprint(sts15_pearson, width=1)
print("\n++++ Task STS16 ++++\n")
pprint.pprint(sts16_pearson, width=1)
```
%% Output
++++ Task STS12 ++++
{'FastText + AVG': 0.6198524200427709,
'FastText + WMD': 0.5280347733615505,
'FastText + SIF': 0.6219464875863289}
%% Cell type:code id: tags:
``` python
print("++++ Task STS13 ++++")
sts13_pearson
```
%% Output
{'FastText + AVG': 0.6005548609319203,
'FastText + SIF': 0.6212579120092905,
'FastText + WMD': 0.5535246774015393,
'GloVe + AVG': 0.550325345521787,
'GloVe + SIF': 0.5887005481387919,
'GloVe + WMD': 0.5511226507959358,
'Word2Vec + AVG': 0.5576731761229754,
'Word2Vec + SIF': 0.5675778156670679,
'Word2Vec + WMD': 0.4735133931943548}
++++ Task STS13 ++++
{'FastText + AVG': 0.690635771639227,
'FastText + WMD': 0.4012119583863098,
'FastText + SIF': 0.7557200532758551}
%% Cell type:code id: tags:
``` python
print("++++ Task STS14 ++++")
sts14_pearson
```
%% Output
{'FastText + AVG': 0.6395854651348138,
'FastText + SIF': 0.743114999290766,
'FastText + WMD': 0.5043251957475268,
'GloVe + AVG': 0.5430947980396821,
'GloVe + SIF': 0.7003751891752676,
'GloVe + WMD': 0.48648450234493545,
'Word2Vec + AVG': 0.6402354732298344,
'Word2Vec + SIF': 0.7226844714342786,
'Word2Vec + WMD': 0.5212277588700386}
++++ Task STS14 ++++
{'FastText + AVG': 0.7264079006748914,
'FastText + WMD': 0.567738188273139,
'FastText + SIF': 0.7492322179215466}
%% Cell type:code id: tags:
``` python
print("++++ Task STS15 ++++")
sts15_pearson
```
%% Output
{'FastText + AVG': 0.666441560463602,
'FastText + SIF': 0.7356129023314942,
'FastText + WMD': 0.5920864958135631,
'GloVe + AVG': 0.5624817686550272,
'GloVe + SIF': 0.7068915343232387,
'GloVe + WMD': 0.5803043426949426,
'Word2Vec + AVG': 0.6867847666634713,
'Word2Vec + SIF': 0.7279177178075755,
'Word2Vec + WMD': 0.6121970541331052}
++++ Task STS15 ++++
{'FastText + AVG': 0.7508655344369192,
'FastText + WMD': 0.6738544035294207,
'FastText + SIF': 0.7609634784830468}
%% Cell type:code id: tags:
``` python
print("++++ Task STS16 ++++")
sts15_pearson
```
%% Output
{'FastText + AVG': 0.6998534252993311,
'FastText + SIF': 0.761278289996281,
'FastText + WMD': 0.6887872169458182,
'GloVe + AVG': 0.6006241865061205,
'GloVe + SIF': 0.7319276750876433,
'GloVe + WMD': 0.6796406524692465,
'Word2Vec + AVG': 0.7048782478026994,
'Word2Vec + SIF': 0.7489534387403396,
'Word2Vec + WMD': 0.6838824012601197}
++++ Task STS16 ++++
{'FastText + AVG': 0.7508655344369192,
'FastText + WMD': 0.6738544035294207,
'FastText + SIF': 0.7609634784830468}
{'FastText + AVG': 0.6326237499753964,
'FastText + SIF': 0.7315395720676849,
'FastText + WMD': 0.6312212885648231,
'GloVe + AVG': 0.5025433957938354,
'GloVe + SIF': 0.6846682589318798,
'GloVe + WMD': 0.6132747807563322,
'Word2Vec + AVG': 0.6390942371456013,
'Word2Vec + SIF': 0.7190592212454177,
'Word2Vec + WMD': 0.6408015822267789}
%% Cell type:code id: tags:
``` python
```
......
No preview for this file type
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment