frflccg/Report.ipynb

484 lines
18 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from nltk.ccg.lexicon import CCGLexicon, Token, augParseCategory\n",
"from nltk.ccg.chart import CCGChart,CCGLeafEdge,BinaryCombinatorRule,CCGEdge,CCGChartParser\n",
"from nltk.ccg.chart import compute_semantics,printCCGDerivation\n",
"from nltk.ccg.combinator import *\n",
"from nltk.tree import Tree\n",
"from nltk.sem.logic import Expression\n",
"from numbers import Number\n",
"import pandas as pd\n",
"import random\n",
"from functools import reduce"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Creating the Lexicon\n",
"\n",
"The lexicon has first been created in a libreoffice spreadsheet, and is under the file `ccg.ods`.\n",
"For each word of the corpus, we gave one, two or three categories in the columns Cat0, Cat1 and Cat2. We also indicated the «classical french grammar name» of the word in order to help us find the catégories. We also did group some words together (with slashes), for example donne/mange have the same catégories, same for le/la/un/mon/ses.\n",
"\n",
"In order to do the rest of the project, we also added a *weight* column for each word group/category couple, and we also added a *semantics* column for each word/category couple, so we could assignate a semantics to be read by the program afterwards.\n",
"\n",
"This allowed us a clean reading, editing and tuning of every parameter of our lexicon.\n",
"\n",
"The spreadsheet is then directly read into the program"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.a Robustness of the grammar\n",
"Our grammar is really simple, as it has really few catégories. However, every category has been highly tuned for its purpose, so very few agrammatical sentences slip through. However we did not take into account tenses, genre and plurals by choice, so some phrases that are parsed may be agrammatical in that regard - for instance, \"il le attrappe\" or \"est (elle souhaite ses fromage)\" or \"le souris\" are parsed. But for what we consider a successful parse, it is a strict grammar.\n",
"\n",
"## 2.b Ambiguity\n",
"The main cause of ambiguity is that there is very few catégories, therefore there is a lot of derivation trees going to the same goal. For example, because we do not differentiate *(méchant chat) noir* and *méchant (chat noir)* because both correspond to a reduction (pN pN pN -> pN pN -> pN). Because we have so few catégories, we don't have things like «adjectives order» that would fix the order in which those trees are parsed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. Implementation of the CKY algorithm\n",
"\n",
"This set of functions compiles into the function `bestTree` that returns the best derivation tree of set of tokens, the lexer we extracted from the spreadsheet, and the set of rules.\n",
"\n",
"We define the weight associated to each reduction rule.\n",
"`rweight(rule)` should return the weight associated to the rule, using its string representation (i.e. the name of the rule)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"valz = {\n",
" '>' : 0.8,\n",
" '<' : 0.8,\n",
" '<B' : 0.7,\n",
" '>B' : 0.7,\n",
" '<Bx' : 0.6,\n",
" '<Sx' : 0.6,\n",
" '<S' : 0.65,\n",
" '>Bx' : 0.6,\n",
" '>Sx' : 0.6,\n",
" '>S' : 0.65\n",
"}\n",
"def rweight(rule):\n",
" s = rule.__str__()\n",
" if s in valz:\n",
" return valz[s]\n",
" else:\n",
" print(\"Unknown rule\",s)\n",
" return 1.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`weightedParse` implements the CKY algorithm, based on the implementation in the nltk library.\n",
"We take the weight from the weighted lexicon for the leafs, and we compute it using the formula for each reduction rule.\n",
"$$ w_{node} = \\phi_r \\times w_{child1} \\times w_{child2}$$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Implements the CYK algorithm, code partly taken from nltk\n",
"def weightedParse(tokens, lex, rules):\n",
" chart = CCGChart(list(tokens))\n",
" \n",
" # Initialize leaf edges.\n",
" for index in range(chart.num_leaves()):\n",
" for token in lex.categories(chart.leaf(index)):\n",
" new_edge = CCGLeafEdge(index, token, chart.leaf(index))\n",
" new_edge.weight = token.weight()\n",
" chart.insert(new_edge, ())\n",
"\n",
" # Select a span for the new edges\n",
" for span in range(2, chart.num_leaves() + 1):\n",
" for start in range(0, chart.num_leaves() - span + 1):\n",
" \n",
" # edges[s] is the best edge generating the category s\n",
" edges = dict()\n",
" \n",
" # Try all possible pairs of edges that could generate\n",
" # an edge for that span\n",
" for part in range(1, span):\n",
" lstart = start\n",
" mid = start + part\n",
" rend = start + span\n",
" \n",
" # For every pair of edges in (lstart,mid) / (mid,rend).\n",
" # They could be multiple edges if they are two categories on the same span\n",
" for left in chart.select(span=(lstart, mid)):\n",
" for right in chart.select(span=(mid, rend)):\n",
" \n",
" # Generate all possible combinations of the two edges\n",
" for rule in rules:\n",
" \n",
" # Can we apply the rule\n",
" if rule.can_combine(left.categ(), right.categ()):\n",
" \n",
" # If so, we create a (potential) new edge for each new category\n",
" for res in rule.combine(left.categ(), right.categ()):\n",
" \n",
" # res is the new category\n",
" edge = CCGEdge(\n",
" span=(left.start(), right.end()),\n",
" categ=res,\n",
" rule=rule,\n",
" )\n",
" # We compute the weight of the edge\n",
" edge.weight = rweight(rule) * left.weight * right.weight\n",
" # And we log the information of where the triple comes from\n",
" edge.triple = (rule,left,right)\n",
" # We remember the heaviest edge for the specific category\n",
" if not(res in edges and edges[res].weight<=edge.weight):\n",
" edges[res] = edge\n",
" # end for rule loop\n",
" # end for right loop\n",
" # end for left loop\n",
" # end for part loop\n",
" # We add for each category the heaviest edge found\n",
" for cat in edges:\n",
" chart.insert(edges[cat], (edges[cat].triple[1], edges[cat].triple[2]))\n",
" \n",
" # We can return the chart we created\n",
" return chart"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def wp_to_tree(edge, wChart):\n",
" \"\"\"\n",
" This functions takes the biggest edge of a chart and the chart, and send back\n",
" the heaviest tree, that is, the one that generated the weightedParse function.\n",
" \"\"\"\n",
" if isinstance(edge,CCGLeafEdge):\n",
" word = Tree(edge.token(), [wChart._tokens[edge.start()]])\n",
" leaf = Tree((edge.token(), \"Leaf\"), [word])\n",
" return leaf\n",
" else:\n",
" children = [wp_to_tree(t, wChart) for t in (edge.triple[1:])]\n",
" lhs = Token(wChart._tokens[edge.start() : edge.end()],\n",
" edge.lhs(),\n",
" compute_semantics(children, edge))\n",
" return Tree((lhs,edge.triple[0].__str__()), children)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def bestTree(tokens, lex, rules):\n",
" wChart = weightedParse(tokens, lex, rules) # We build the weighgted parse tree using cky\n",
" edge = list(wChart.select(start=0,end=len(tokens)))[0] # We get the biggest edge\n",
" t = wp_to_tree(edge, wChart) # We get the tree that brought us to this edge\n",
" return (t,edge.weight)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Application of the algorithms and the grammar"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Weighed Lexicon"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from warnings import warn\n",
"\n",
"class WeighedToken(Token):\n",
" def __init__(self, token, categ, semantics=None, weight = 1.0):\n",
" super().__init__(token, categ, semantics= semantics)\n",
" self._weight = weight\n",
" def weight(self):\n",
" \"\"\"1.0 is considered the default weight for any token\"\"\"\n",
" try:\n",
" return self._weight\n",
" except AttributeError:\n",
" warn(f\"[{self.token} : {str(self)}] : this token has no weight attribute, defaulted to 1.0.\")\n",
" return 1.0\n",
"\n",
"class WeighedLexicon(CCGLexicon):\n",
" def __init__(self, start, primitives, families, entries):\n",
" super().__init__(start, primitives, families, entries)\n",
"\n",
" def weight(self, entry):\n",
" return entry.weight()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading of the spreadsheet"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def to_pseudo_entries(table, consider_semantics = False, categories_max_options = 4):\n",
" \"\"\"returns a list of lists in the format ['word', 'category', 'weight', None]\n",
" if consider_semantics == false else ['word', 'category', weight, 'semantic']\n",
" that is left to be converted into tokens by to_wlex_entries\"\"\"\n",
"\n",
" entries = list()\n",
" for line in range(len(table['MOT'])):\n",
" for wdi, word in enumerate(table['MOT'][line].replace(\" \", \"\").split('/')):\n",
" for j in range(categories_max_options):\n",
" if isinstance(table['Cat'+str(j)][line],str):\n",
" category = table['Cat'+str(j)][line]\n",
" weight = float(table['Weights'+str(j)][line]) if isinstance(table['Weights'+str(j)][line], Number) else 1.0\n",
" if consider_semantics:\n",
" semantic = (table['Sem'+str(j)][line].replace('\\\\\\\\', '\\\\').split('/'))[wdi]\n",
" else:\n",
" semantic = None\n",
" entries.append([word, category, weight, semantic])\n",
" return entries\n",
"\n",
"def to_wlex_entries(pseudo_entries, primitives, families, var=None):\n",
" \"\"\"returns the entries to a weighed lexicon from pseudo_entries generated by to_pseudo_entries\"\"\"\n",
" entries = dict()\n",
" for entry in pseudo_entries:\n",
" if entry[0] not in entries:\n",
" entries[entry[0]] = list()\n",
" categ, _ = augParseCategory(entry[1], primitives, families, var)\n",
" token = WeighedToken(token= entry[0],\n",
" categ= categ,\n",
" semantics= None if entry[-1] is None else Expression.fromstring(entry[-1]),\n",
" weight= entry[2])\n",
" entries[entry[0]].append(token)\n",
" return entries\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instantiation of the lexicon"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Catégories primitives et familles\n",
"primitives = ['S', 'N', 'Pp', 'pN']\n",
"V = augParseCategory(\"S\\\\N\", primitives = primitives, families={})\n",
"families = {'V': V}\n",
"\n",
"# On importe notre lexique sous forme de tableur\n",
"table = pd.read_excel(\"ccg.ods\", engine=\"odf\")\n",
"# On le convertit en Lexique pondéré\n",
"pe = to_pseudo_entries(table, consider_semantics = True)\n",
"wEntries = to_wlex_entries(pseudo_entries= pe, primitives= primitives, families= families)\n",
"lex = WeighedLexicon(start= 'S', primitives= primitives, families= families, entries= wEntries)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instantiation of the reduction rules"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# On crée le parser, on donne l'ensemble des règles qu'il est sensé connaître\n",
"rulesC = [ForwardApplication,BackwardApplication] \n",
"rulesC += [ForwardComposition,BackwardComposition,BackwardBx]\n",
"rulesC += [ForwardSubstitution,BackwardSx]\n",
"rulesR = [BinaryCombinatorRule(c) for c in rulesC]\n",
"\n",
"parser = CCGChartParser(lex, rulesR)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On lit les phrases depuis le fichier `phrases.txt`, et pour chacune, on imprime le nombre de dérivations trouvées, ainsi que le meilleur arbre de dérivation (i.e. de meilleur poids)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading test sentences"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# On lit les phrases dans le fichier\n",
"f = open('phrases.txt')\n",
"lines = f.readlines()\n",
"f.close()\n",
"\n",
"phrases = [p.lower().strip().split() for p in lines]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running the above algorithms"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for tokens in phrases:\n",
" \n",
" print(reduce(lambda x,y: x + \" \" + y,tokens, \"\"))\n",
" # On compte les arbres de dérivation trouvés\n",
" try:\n",
" i = len(list(parser.parse(tokens)))\n",
" except:\n",
" print(\"#SOME RANDOM ASSERT ERROR EVEN IF EVERYTHING WORKS FINE#\")\n",
" \n",
" print(\"Found\",i,\"derivations for sentence\",*tokens)\n",
"\n",
" # On affiche la dérivation la meilleure pour l'arbre\n",
" if (i != 0):\n",
" try:\n",
" t,d = bestTree(tokens, lex, rulesC)\n",
" except:\n",
" print(\"#SOME RANDOM ASSERT ERROR EVEN IF EVERYTHING WORKS FINE#\")\n",
" print(\"Best derivation tree has weight\",d)\n",
" print('\\n')\n",
" printCCGDerivation(t)\n",
" print('\\n')\n",
" print(\"#\"*42)\n",
" print('\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Randomized testing\n",
"\n",
"The code that we have used for testing strictness."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def import_words(table):\n",
" return list({word for line in range(len(table['MOT'])) for word in table['MOT'][line].replace(\" \", \"\").split('/')})\n",
"\n",
"def random_tests():\n",
" Words = import_words(table)\n",
" random_phrases = [reduce(lambda x,y: x + \" \" + y + \" \", random.sample(Words, random.sample(range(2,11), 1)[0]), \"\") for i in range(500)]\n",
"\n",
" parsed = list()\n",
" parses = dict()\n",
" unparsed = list()\n",
" for phr in random_phrases:\n",
" try:\n",
" t,d = bestTree(phr.split(), lex, rulesC)\n",
" parsed.append(phr)\n",
" parses[phr] = t\n",
" except:\n",
" unparsed.append(phr)\n",
"\n",
" print(\"=\"*50)\n",
" print(f\"found the following {len(parsed)} derivations:\")\n",
" for phr in parsed:\n",
" print(phr + \" :\")\n",
" printCCGDerivation(parses[phr])\n",
" print(\"=\"*50)\n",
" print(f'{len(unparsed)} are left unparsed :')\n",
" for phr in unparsed:\n",
" print(phr)\n",
"\n",
"random_tests()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(lex)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "venv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}