Wikipedia search using TFIDF

Term Frequecy Inverse Document Frequency

please, call, the, number, below, do, not, us, please call, call the, the number, number below, please do, do not, not call, call us

dimension = [2, 16]

Example of unigram TFIDF

Imports

import pandas as pd
import numpy as np
import os
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import udf
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

SparkSession

spark = SparkSession.builder \
    .appName('tfidf')\
    .config('spark.jars', '../jars/snowflake-jdbc-3.13.6.jar, ../jars/spark-snowflake_2.12-2.9.0-spark_3.1.jar') \
    .getOrCreate()
spark.sparkContext.setLogLevel("WARN")
22/12/27 13:35:58 WARN Utils: Your hostname, SPMBP136.local resolves to a loopback address: 127.0.0.1; using 192.168.0.101 instead (on interface en6)
22/12/27 13:35:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/12/27 13:35:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
file_path = "../datasets/wiki.csv"

wiki = spark.read.format("csv").option("header", "true").load(file_path)
wiki.show()
+---+--------------------+-------------------+--------------------+
| ID|               Title|               Time|            Document|
+---+--------------------+-------------------+--------------------+
| 12|           Anarchism|2008-12-30 06:23:05|"Anarchism (somet...|
| 25|              Autism|2008-12-24 20:41:05|"Autism is a brai...|
| 39|              Albedo|2008-12-29 18:19:09|"The albedo of an...|
|290|                   A|2008-12-27 04:33:16|"The letter A is ...|
|303|             Alabama|2008-12-29 08:15:47|"Alabama (formall...|
|305|            Achilles|2008-12-30 06:18:01|"thumb\n\nIn Gree...|
|307|     Abraham Lincoln|2008-12-28 20:18:23|"Abraham Lincoln ...|
|308|           Aristotle|2008-12-29 23:54:48|"Aristotle (Greek...|
|309|An American in Paris|2008-09-27 19:29:28|"An American in P...|
|324|       Academy Award|2008-12-28 17:50:43|"The Academy Awar...|
|330|             Actrius|2008-05-23 15:24:32|Actrius (Actresse...|
|332|     Animalia (book)|2008-12-18 11:12:34|thumbAnimalia (IS...|
|334|International Ato...|2008-11-21 22:40:20|International Ato...|
|336|            Altruism|2008-12-27 03:57:17|"Altruism is self...|
|339|            Ayn Rand|2008-12-30 08:03:06|"Ayn Rand (,  – M...|
|340|        Alain Connes|2008-09-03 13:41:39|Alain Connes (bor...|
|344|          Allan Dwan|2008-11-14 05:28:58|Allan Dwan (April...|
|358|             Algeria|2008-12-29 02:54:36|"Algeria (, al-Ja...|
|359|List of character...|2008-12-23 20:20:21|"This is a list o...|
|569|        Anthropology|2008-12-28 23:04:30|"Anthropology (, ...|
+---+--------------------+-------------------+--------------------+
only showing top 20 rows
wiki.filter(wiki.Document.isNull()).count()
1
wiki = wiki.filter(~wiki.Document.isNull())
wiki.show()
+---+--------------------+-------------------+--------------------+
| ID|               Title|               Time|            Document|
+---+--------------------+-------------------+--------------------+
| 12|           Anarchism|2008-12-30 06:23:05|"Anarchism (somet...|
| 25|              Autism|2008-12-24 20:41:05|"Autism is a brai...|
| 39|              Albedo|2008-12-29 18:19:09|"The albedo of an...|
|290|                   A|2008-12-27 04:33:16|"The letter A is ...|
|303|             Alabama|2008-12-29 08:15:47|"Alabama (formall...|
|305|            Achilles|2008-12-30 06:18:01|"thumb\n\nIn Gree...|
|307|     Abraham Lincoln|2008-12-28 20:18:23|"Abraham Lincoln ...|
|308|           Aristotle|2008-12-29 23:54:48|"Aristotle (Greek...|
|309|An American in Paris|2008-09-27 19:29:28|"An American in P...|
|324|       Academy Award|2008-12-28 17:50:43|"The Academy Awar...|
|330|             Actrius|2008-05-23 15:24:32|Actrius (Actresse...|
|332|     Animalia (book)|2008-12-18 11:12:34|thumbAnimalia (IS...|
|334|International Ato...|2008-11-21 22:40:20|International Ato...|
|336|            Altruism|2008-12-27 03:57:17|"Altruism is self...|
|339|            Ayn Rand|2008-12-30 08:03:06|"Ayn Rand (,  – M...|
|340|        Alain Connes|2008-09-03 13:41:39|Alain Connes (bor...|
|344|          Allan Dwan|2008-11-14 05:28:58|Allan Dwan (April...|
|358|             Algeria|2008-12-29 02:54:36|"Algeria (, al-Ja...|
|359|List of character...|2008-12-23 20:20:21|"This is a list o...|
|569|        Anthropology|2008-12-28 23:04:30|"Anthropology (, ...|
+---+--------------------+-------------------+--------------------+
only showing top 20 rows
tokenizer = Tokenizer(inputCol="Document", outputCol="words")
wordsData = tokenizer.transform(wiki)
wordsData.show()
+---+--------------------+-------------------+--------------------+--------------------+
| ID|               Title|               Time|            Document|               words|
+---+--------------------+-------------------+--------------------+--------------------+
| 12|           Anarchism|2008-12-30 06:23:05|"Anarchism (somet...|["anarchism, (som...|
| 25|              Autism|2008-12-24 20:41:05|"Autism is a brai...|["autism, is, a, ...|
| 39|              Albedo|2008-12-29 18:19:09|"The albedo of an...|["the, albedo, of...|
|290|                   A|2008-12-27 04:33:16|"The letter A is ...|["the, letter, a,...|
|303|             Alabama|2008-12-29 08:15:47|"Alabama (formall...|["alabama, (forma...|
|305|            Achilles|2008-12-30 06:18:01|"thumb\n\nIn Gree...|["thumb\n\nin, gr...|
|307|     Abraham Lincoln|2008-12-28 20:18:23|"Abraham Lincoln ...|["abraham, lincol...|
|308|           Aristotle|2008-12-29 23:54:48|"Aristotle (Greek...|["aristotle, (gre...|
|309|An American in Paris|2008-09-27 19:29:28|"An American in P...|["an, american, i...|
|324|       Academy Award|2008-12-28 17:50:43|"The Academy Awar...|["the, academy, a...|
|330|             Actrius|2008-05-23 15:24:32|Actrius (Actresse...|[actrius, (actres...|
|332|     Animalia (book)|2008-12-18 11:12:34|thumbAnimalia (IS...|[thumbanimalia, (...|
|334|International Ato...|2008-11-21 22:40:20|International Ato...|[international, a...|
|336|            Altruism|2008-12-27 03:57:17|"Altruism is self...|["altruism, is, s...|
|339|            Ayn Rand|2008-12-30 08:03:06|"Ayn Rand (,  – M...|["ayn, rand, (,, ...|
|340|        Alain Connes|2008-09-03 13:41:39|Alain Connes (bor...|[alain, connes, (...|
|344|          Allan Dwan|2008-11-14 05:28:58|Allan Dwan (April...|[allan, dwan, (ap...|
|358|             Algeria|2008-12-29 02:54:36|"Algeria (, al-Ja...|["algeria, (,, al...|
|359|List of character...|2008-12-23 20:20:21|"This is a list o...|["this, is, a, li...|
|569|        Anthropology|2008-12-28 23:04:30|"Anthropology (, ...|["anthropology, (...|
+---+--------------------+-------------------+--------------------+--------------------+
only showing top 20 rows
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures")
featuredData = hashingTF.transform(wordsData)
featuredData.show()
+---+--------------------+-------------------+--------------------+--------------------+--------------------+
| ID|               Title|               Time|            Document|               words|         rawFeatures|
+---+--------------------+-------------------+--------------------+--------------------+--------------------+
| 12|           Anarchism|2008-12-30 06:23:05|"Anarchism (somet...|["anarchism, (som...|(262144,[15157,27...|
| 25|              Autism|2008-12-24 20:41:05|"Autism is a brai...|["autism, is, a, ...|(262144,[15,1546,...|
| 39|              Albedo|2008-12-29 18:19:09|"The albedo of an...|["the, albedo, of...|(262144,[7853,240...|
|290|                   A|2008-12-27 04:33:16|"The letter A is ...|["the, letter, a,...|(262144,[6037,942...|
|303|             Alabama|2008-12-29 08:15:47|"Alabama (formall...|["alabama, (forma...|(262144,[1797,256...|
|305|            Achilles|2008-12-30 06:18:01|"thumb\n\nIn Gree...|["thumb\n\nin, gr...|(262144,[10758,16...|
|307|     Abraham Lincoln|2008-12-28 20:18:23|"Abraham Lincoln ...|["abraham, lincol...|(262144,[2564,460...|
|308|           Aristotle|2008-12-29 23:54:48|"Aristotle (Greek...|["aristotle, (gre...|(262144,[2767,356...|
|309|An American in Paris|2008-09-27 19:29:28|"An American in P...|["an, american, i...|(262144,[2366,670...|
|324|       Academy Award|2008-12-28 17:50:43|"The Academy Awar...|["the, academy, a...|(262144,[2931,328...|
|330|             Actrius|2008-05-23 15:24:32|Actrius (Actresse...|[actrius, (actres...|(262144,[6558,674...|
|332|     Animalia (book)|2008-12-18 11:12:34|thumbAnimalia (IS...|[thumbanimalia, (...|(262144,[2284,609...|
|334|International Ato...|2008-11-21 22:40:20|International Ato...|[international, a...|(262144,[847,925,...|
|336|            Altruism|2008-12-27 03:57:17|"Altruism is self...|["altruism, is, s...|(262144,[5675,680...|
|339|            Ayn Rand|2008-12-30 08:03:06|"Ayn Rand (,  – M...|["ayn, rand, (,, ...|(262144,[528,1091...|
|340|        Alain Connes|2008-09-03 13:41:39|Alain Connes (bor...|[alain, connes, (...|(262144,[154,1595...|
|344|          Allan Dwan|2008-11-14 05:28:58|Allan Dwan (April...|[allan, dwan, (ap...|(262144,[1578,181...|
|358|             Algeria|2008-12-29 02:54:36|"Algeria (, al-Ja...|["algeria, (,, al...|(262144,[3852,492...|
|359|List of character...|2008-12-23 20:20:21|"This is a list o...|["this, is, a, li...|(262144,[14376,19...|
|569|        Anthropology|2008-12-28 23:04:30|"Anthropology (, ...|["anthropology, (...|(262144,[57138,10...|
+---+--------------------+-------------------+--------------------+--------------------+--------------------+
only showing top 20 rows
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featuredData)
rescaledData = idfModel.transform(featuredData)
rescaledData.show()
22/12/27 13:36:11 WARN DAGScheduler: Broadcasting large task binary with size 4.0 MiB
+---+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+
| ID|               Title|               Time|            Document|               words|         rawFeatures|            features|
+---+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+
| 12|           Anarchism|2008-12-30 06:23:05|"Anarchism (somet...|["anarchism, (som...|(262144,[15157,27...|(262144,[15157,27...|
| 25|              Autism|2008-12-24 20:41:05|"Autism is a brai...|["autism, is, a, ...|(262144,[15,1546,...|(262144,[15,1546,...|
| 39|              Albedo|2008-12-29 18:19:09|"The albedo of an...|["the, albedo, of...|(262144,[7853,240...|(262144,[7853,240...|
|290|                   A|2008-12-27 04:33:16|"The letter A is ...|["the, letter, a,...|(262144,[6037,942...|(262144,[6037,942...|
|303|             Alabama|2008-12-29 08:15:47|"Alabama (formall...|["alabama, (forma...|(262144,[1797,256...|(262144,[1797,256...|
|305|            Achilles|2008-12-30 06:18:01|"thumb\n\nIn Gree...|["thumb\n\nin, gr...|(262144,[10758,16...|(262144,[10758,16...|
|307|     Abraham Lincoln|2008-12-28 20:18:23|"Abraham Lincoln ...|["abraham, lincol...|(262144,[2564,460...|(262144,[2564,460...|
|308|           Aristotle|2008-12-29 23:54:48|"Aristotle (Greek...|["aristotle, (gre...|(262144,[2767,356...|(262144,[2767,356...|
|309|An American in Paris|2008-09-27 19:29:28|"An American in P...|["an, american, i...|(262144,[2366,670...|(262144,[2366,670...|
|324|       Academy Award|2008-12-28 17:50:43|"The Academy Awar...|["the, academy, a...|(262144,[2931,328...|(262144,[2931,328...|
|330|             Actrius|2008-05-23 15:24:32|Actrius (Actresse...|[actrius, (actres...|(262144,[6558,674...|(262144,[6558,674...|
|332|     Animalia (book)|2008-12-18 11:12:34|thumbAnimalia (IS...|[thumbanimalia, (...|(262144,[2284,609...|(262144,[2284,609...|
|334|International Ato...|2008-11-21 22:40:20|International Ato...|[international, a...|(262144,[847,925,...|(262144,[847,925,...|
|336|            Altruism|2008-12-27 03:57:17|"Altruism is self...|["altruism, is, s...|(262144,[5675,680...|(262144,[5675,680...|
|339|            Ayn Rand|2008-12-30 08:03:06|"Ayn Rand (,  – M...|["ayn, rand, (,, ...|(262144,[528,1091...|(262144,[528,1091...|
|340|        Alain Connes|2008-09-03 13:41:39|Alain Connes (bor...|[alain, connes, (...|(262144,[154,1595...|(262144,[154,1595...|
|344|          Allan Dwan|2008-11-14 05:28:58|Allan Dwan (April...|[allan, dwan, (ap...|(262144,[1578,181...|(262144,[1578,181...|
|358|             Algeria|2008-12-29 02:54:36|"Algeria (, al-Ja...|["algeria, (,, al...|(262144,[3852,492...|(262144,[3852,492...|
|359|List of character...|2008-12-23 20:20:21|"This is a list o...|["this, is, a, li...|(262144,[14376,19...|(262144,[14376,19...|
|569|        Anthropology|2008-12-28 23:04:30|"Anthropology (, ...|["anthropology, (...|(262144,[57138,10...|(262144,[57138,10...|
+---+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows
def search_article(keyword):
    # get the hash val from keyword
    schema = StructType([StructField("words", ArrayType(StringType()))])
    temp = spark.createDataFrame(([[[keyword]]]), schema).toDF("words")
    temp_unhashed = hashingTF.transform(temp).select("rawFeatures").collect()
    val = int(temp_unhashed[0].rawFeatures.indices[0])
    #
    termExtractor = udf(lambda x:float(x[val]), FloatType())
    final = rescaledData.withColumn('score', termExtractor(rescaledData.features))
    final = final.filter("score>0").orderBy("score", ascending=False)
    return final.select('ID', 'Title', 'score')
search_article('mystery').show()
22/12/27 13:36:12 WARN DAGScheduler: Broadcasting large task binary with size 4.1 MiB


[Stage 11:===================>                                      (1 + 2) / 3]

+----+--------------------+--------+
|  ID|               Title|   score|
+----+--------------------+--------+
| 984|     Agatha Christie|5.521461|
| 986|          The Plague|5.521461|
|1307|The Alan Parsons ...|5.521461|
+----+--------------------+--------+
search_article('comic').show()
22/12/27 13:36:14 WARN DAGScheduler: Broadcasting large task binary with size 4.1 MiB
+----+--------------------+----------+
|  ID|               Title|     score|
+----+--------------------+----------+
| 931|The Amazing Spide...|14.4849415|
|2101|             Asterix|  9.656628|
|1549|             Agathon|  9.656628|
|2023|           Aeschylus|  9.656628|
|1028|        Aristophanes|  9.656628|
|1614|              Alexis|  4.828314|
|1784|  Athenian democracy|  4.828314|
+----+--------------------+----------+
search_article('revolution').show()
22/12/27 13:36:15 WARN DAGScheduler: Broadcasting large task binary with size 4.1 MiB
+----+--------------------+---------+
|  ID|               Title|    score|
+----+--------------------+---------+
|1973| American Revolution|12.052151|
|2273|            AFC Ajax|4.0173836|
| 339|            Ayn Rand|4.0173836|
| 572|Agricultural science|4.0173836|
| 771|American Revoluti...|4.0173836|
| 915|       Andrey Markov|4.0173836|
| 930|       Alvin Toffler|4.0173836|
|1030|     Austrian School|4.0173836|
|1057|      Anatole France|4.0173836|
|1192| Artistic revolution|4.0173836|
|1316|      Annales School|4.0173836|
|1676|Alfonso XII of Spain|4.0173836|
|1363|  André-Marie Ampère|4.0173836|
|2075|  Aircraft hijacking|4.0173836|
|1784|  Athenian democracy|4.0173836|
|1844|          Archimedes|4.0173836|
|2070|Act of Settlement...|4.0173836|
+----+--------------------+---------+
search_article('football').show()
22/12/27 13:36:16 WARN DAGScheduler: Broadcasting large task binary with size 4.1 MiB
+----+--------------------+---------+
|  ID|               Title|    score|
+----+--------------------+---------+
|2273|            AFC Ajax|54.596165|
|2357|American Football...|46.196754|
|2174|        Arsenal F.C.|29.397936|
|2358|           A.S. Roma| 25.19823|
|2102|   Arizona Cardinals|20.998526|
|2103|     Atlanta Falcons| 16.79882|
| 615|American Football...| 16.79882|
| 925|Alumni Athletic Club|12.599115|
|2289|  AZ (football club)| 4.199705|
|2310|       Arthur Miller| 4.199705|
|1797|                Acre| 4.199705|
|2363|Alessandro Scarlatti| 4.199705|
|2382|               Aalen| 4.199705|
|1016|       Achill Island| 4.199705|
+----+--------------------+---------+
search_article('emirates').show()
22/12/27 13:36:17 WARN DAGScheduler: Broadcasting large task binary with size 4.1 MiB
+----+------------+--------+
|  ID|       Title|   score|
+----+------------+--------+
|2174|Arsenal F.C.|6.214608|
+----+------------+--------+
search_article('the').show()
22/12/27 13:36:18 WARN DAGScheduler: Broadcasting large task binary with size 4.1 MiB
+----+--------------------+---------+
|  ID|               Title|    score|
+----+--------------------+---------+
|1854| Geography of Africa|56.093544|
|2273|            AFC Ajax|43.326492|
|2023|           Aeschylus|41.968296|
|1216|              Athens|30.287798|
| 717|             Alberta|26.213207|
|2358|           A.S. Roma|23.904272|
| 841|      Attila the Hun|23.360992|
|1285|Geography of Alabama|23.089354|
|2338|Rise and Fall of ...|21.323696|
|1440|       Abydos, Egypt|19.150581|
| 904|           Aluminium| 18.87894|
|1905|              Ambush|18.199842|
|1962|  Apparent magnitude|17.928204|
|1557|Agrippina the You...|17.792383|
|1613|  Alexios I Komnenos|17.792383|
|1234|     Acoustic theory|17.520744|
|2064|      Antonio Canova|15.619268|
|1686| Alfonso V of Aragon| 15.07599|
|1451|APL (programming ...| 15.07599|
|2274|Arthur Stanley Ed...| 14.80435|
+----+--------------------+---------+
only showing top 20 rows