Category:I703 Python: Difference between revisions
No edit summary |
|||
Line 354: | Line 354: | ||
* Add GeoIP lookup to your log parser | * Add GeoIP lookup to your log parser | ||
* Highlight countries on the world map | * Highlight countries on the world map | ||
* Use [http://www.w3schools.com/cssref/css_colors_legal.asp HSL color codes] to make your life easier | |||
* Commit changes to your Git repository, but do NOT include the GeoIP database in your program source | * Commit changes to your Git repository, but do NOT include the GeoIP database in your program source | ||
Revision as of 09:01, 11 March 2016
General
The Python Course is 4 ECTS
Lecturer: Lauri Võsandi
E-mail: lauri [donut] vosandi [plus] i703 [ät] gmail [dotchka] com
General
- This is not a course for slacking off
- Deduplicate work - use the same stuff for Research Project I (Projekt I) course or combine it with Web Application Programming (Võrgurakendused I).
- I expect you to understand by now:
- OOP concepts: loops, functions, classes etc
- Networking fundamentals: UDP/TCP ports, logical/hardware address, hostname, domain
- Get along on the command line: cp, mv, mkdir, cd, ssh user@host, scp file user@host:
- Possible scenarios to pass the course:
- Scratch your own itch, the most preferred option, should keep you motivated and happy
- Create a local UI or agent for your PHP project's API
- Find an (open-source mainly Python-based) project you want to help and prepare to participate on Google Summer of Code
- Prepare scenario with some scripts for Cyberolympics competition
- Pick something below and hope Lauri gets you a keg of beer
- Progress visible in Git at least throughout the second half-semester
- (Learn how to) use Google, I am not your tech support
- Of course I am there if you're stuck with some corner case or have issues understanding some concepts :)
- When asking for help please try to form properly phrased questions
- Help each other, socialize, have a beer event and ask me to join as well ;)
- If you're new to programming make sure you first follow the Python track at CodeAcademy, then continue with Learn Python the Hard Way. Videos about Python in general, pygame for game development, PyGTK for creating GUI-s.
- If you need more practicing attend CodeClub at Mektory on Wednesdays 18:00, they usually have different exercise every week for beginners
- If it looks like there is not much Python programming in this course then that sounds like a good conclusion - that's how Python mainly is used in real life, to glue different components together so they would bring additional value. Don't be afraid to learn other technologies ;)
Lectures/workshops
We'll have something for the first half of semester so you would be able to write a Python script that can parse input of different kind, process them and output something with added value (blog, reports, etc):
- Hello world with Python, setting up Git repo
- Working with text files, CSV, messing around with Unicode
- Working with JSON, XML, Markdown files
- Using matplotlib and charting data in general
- Using numpy and scipy
- Interacting with databases
- Building networked applications
- Threads and event loops, running apps under uwsgi, using server-side events
- Regular expressions
- Working with Falcon API framework
- Working with Django web framework, ORM and templating engines
- Network application security
These are the topics to learn if you're afraid to pick anything else below.
Lectures/labs
Lecture/lab #1
In this lecture/lab we are going to see how we can parse Apache web server log files. These log files contain information about each HTTP request that was made against the web server. Get the example input file here and check out how the file format looks like. If you are working remotely on enos.itcollege.ee you can simply refer to /var/log/apache2/access.log
Easily readable version:
fh = open("access.log")
keywords = "Windows", "Linux", "OS X", "Ubuntu", "Googlebot", "bingbot", "Android", "YandexBot", "facebookexternalhit"
d = {} # Curly braces define empty dictionary
total = 0
for line in fh:
total = total + 1
try:
source_timestamp, request, response, referrer, _, agent, _ = line.split("\"")
method, path, protocol = request.split(" ")
for keyword in keywords:
if keyword in agent:
try:
d[keyword] = d[keyword] + 1
except KeyError:
d[keyword] = 1
break # Stop searching for other keywords
except ValueError:
pass # This will do nothing, needed due to syntax
print "Total lines:", total
results = d.items()
results.sort(key = lambda item:item[1], reverse=True)
for keyword, hits in results:
print keyword, "==>", hits, "(", hits * 100 / total, "%)"
Refined version:
fh = open("access.log")
keywords = "Windows", "Linux", "OS X", "Ubuntu", "Googlebot", "bingbot", "Android", "YandexBot", "facebookexternalhit"
d = {}
for line in fh:
try:
source_timestamp, request, response, referrer, _, agent, _ = line.split("\"")
method, path, protocol = request.split(" ")
for keyword in keywords:
if keyword in agent:
d[keyword] = d.get(keyword, 0) + 1
break
except ValueError:
pass
total = sum(d.values())
print "Total lines with requested keywords:", total
for keyword, hits in sorted(d.items(), key = lambda (keyword,hits):-hits):
print "%s => %d (%.02f%%)" % (keyword, hits, hits * 100 / total)
Exercises:
- Try to reduce the amount of lines:
- Improve the log file parsing with CSV reader or regular expressions.
- Improve the counting with Counter object.
- Add extra functionality:
- What were the top 5 requested URL-s?
- Whose URL-s are the most popular? Hint: /~username/ in the beginning of the URL is college user account.
- How much is this user causing traffic? Hint: the response size in bytes is in the variable 'response'.
- Use urllib.unquote to normalize paths.
Lecture/lab #2
So far we've dealed with only one file, usually you're digging through many files and you'd like to automate your work as much as possible. At enos.itcollege.ee you can find all the Apache log files under directory /var/log/apache2. Download the files to your local machine:
rsync -av username@enos.itcollege.ee:/var/log/apache2/ ~/logs/
Alternatively you just can invoke the Python on enos:
ssh username@enos.itcollege.ee
python path/to/script.py
Following simply iterates over the files in the directory and skips the unwanted ones:
import os
# Following is the directory with log files,
# On Windows substitute it where you downloaded the files
root = "/var/log/apache2"
for filename in os.listdir(root):
if not filename.startswith("access.log"):
print "Skipping unknown file:", filename
continue
if filename.endswith(".gz"):
print "Skipping compressed file:", filename
continue
print "Going to process:", filename
for line in open(os.path.join(root, filename)):
pass # Insert magic here
You can use the gzip module to read compressed files denoted with .gz extension:
import gzip
# gzip.open will give you a file object which transparently uncompresses the file as it's read
for line in gzip.open("/var/log/apache2/access.log.1.gz"):
print line
Combine what you've learned so far to parse all access.log and access.log.*.gz files under /var/log/apache2.
Set up Git, you'll have to do this on every machine you use:
echo | ssh-keygen -C '' -P ''
git config --global user.name "$(getent passwd $USER | cut -d ":" -f 5)"
git config --global user.email $USER@itcollege.ee
git config --global core.editor "gedit -w -s"
Create a repository at GitHub and in your source code tree:
git init
git remote add origin git@github.com:user-name/log-parser.git
git add *.py
git commit -m "Initial commit"
git push -u origin master
Also create .gitignore file and upload the changes. See example repository here.
Exercises:
- Add extra functionality:
- Create humanize() function which takes number of bytes as input and returns human readable string (eg 8192 bytes becomes 8kB and 5242880 becomes 5MB)
- Use argparse to supply directory path during script invocation and make your program configurable.
Lecture/lab #3
Use following to parse the command-line arguments:
import argparse
parser = argparse.ArgumentParser(description='Apache2 log parser.')
parser.add_argument('--path',
help="Path to Apache2 log files", default="/var/log/apache2")
parser.add_argument('--verbose',
help="Increase verbosity", action="store_true")
args = parser.parse_args()
print "Log files are expected to be in", args.path
print "Am I going to be extra chatty?", args.verbose
Now invoke the program with default arguments as following:
python path/to/example.py
Program can be invoked with user supplied parameters as following:
python path/to/example.py --path ~/logs --verbose
Use following to humanize file sizes/transferred bytes, try to make it shorter!
def humanize(bytes):
if bytes < 1024:
return "%d B" % bytes
elif bytes < 1024 ** 2:
return "%.1f kB" % (bytes / 1024.0)
elif bytes < 1024 ** 3:
return "%.1f MB" % (bytes / 1024.0 ** 2)
else:
return "%.1f GB" % (bytes / 1024.0 ** 3)
Use datetime to manipulate date/time information, example here:
import os
from datetime import datetime
files = []
for filename in os.listdir("."):
mode, inode, device, nlink, uid, gid, size, atime, mtime, ctime = os.stat(filename)
files.append((filename, datetime.fromtimestamp(mtime), size)) # Append 3-tuple to list
files.sort(key = lambda(filename, dt, size):dt)
for filename, dt, size in files:
print filename, dt, humanize(size)
print "Newest file is:", files[-1][0]
print "Oldest file is:", files[0][0]
Exercises:
- Add functionality to our log parser:
- What is the timespan (from-to timestamp) for the results? Use datetime.strptime to parse the timestamps from log files.
- Add support for Common Log Format.
- What were the most erroneous URL-s? Hint: Use the HTTP status code to determine if there was an error or not.
- What were the operating systems used to visit the URL-s?
- What were the top 5 Firefox versions used to visit the URL-s?
- What were the top 5 referrers? Their hostnames?
- Advanced:
Lecture/lab #4 (3. March)
This time we'll try to make some sense out of IP addresses found in the log file.
In case of a personal Ubuntu machine install additional modules for Python 2.x:
apt-get install python-geoip python-ipaddr python-cssselect
For Python 3.x on Ubuntu:
apt-get install python3-geoip python3-ipaddr
For Mac you can try:
pip install geoip ipaddr lxml cssselect
On Ubuntu you can install GeoIP database as a package, but note that it might be out of date:
sudo apt-get install geoip-database # This places the database under /usr/share/GeoIP/GeoIP.dat
To get up to date database or to download it for Mac:
wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCountry/GeoIP.dat.gz
gunzip GeoIP.dat.gz
Run example:
import GeoIP
gi = GeoIP.open("/usr/share/GeoIP/GeoIP.dat", GeoIP.GEOIP_MEMORY_CACHE)
print "Gotcha:", gi.country_code_by_addr("194.126.115.18").lower()
Download world map in SVG format:
wget https://upload.wikimedia.org/wikipedia/commons/0/03/BlankMap-World6.svg
SVG is essentially a XML-based language for describing vector graphics, hence you can use standard XML parsing tools to modify such file. Use lxml to highlight a country in the map and save modified file:
from lxml import etree
from lxml.cssselect import CSSSelector
document = etree.parse(open('BlankMap-World6.svg'))
sel = CSSSelector("#ee")
for j in sel(document):
j.set("style", "fill:red")
# Remove styling from children
for i in j.iterfind("{http://www.w3.org/2000/svg}path"):
i.attrib.pop("class", "")
with open("highlighted.svg", "w") as fh:
fh.write(etree.tostring(document))
Exercises:
- Add GeoIP lookup to your log parser
- Highlight countries on the world map
- Use HSL color codes to make your life easier
- Commit changes to your Git repository, but do NOT include the GeoIP database in your program source
Lecture/lab #5 (10. March)
In this lab we take a look how we can use Jinja templating engine to output HTML.
In case of a personal Ubuntu machine install additional modules for Python 2.x:
apt-get install python-jinja2
For Python 3.x on Ubuntu:
apt-get install python3-jinja2
For Mac you can try:
pip install jinja2
The template placed in report.html next to main.py:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>Out awesome report</title>
<link rel="css/style.css" type="text/css"/>
<script type="text/javascript" src="js/main.js"></script>
</head>
<body>
<h1>Top bandwidth hoggers</h1>
<ul>
{% for user, bytes in user_bytes[:5] %}
<li>{{ user }}: {{ humanize(bytes) }}</li>
{% endfor %}
</ul>
<h1>Visits per country</h1>
<img src="highlighted.svg"/>
</body>
</html>
The Python snippet for generating output.html from report.html:
user_bytes = sorted(user_bytes.items(), key = lambda item:item[1], reverse=True)
from jinja2 import Environment, FileSystemLoader # This it the templating engine we will use
env = Environment(
loader=FileSystemLoader(os.path.dirname(__file__)),
trim_blocks=True)
import codecs
with codecs.open("output.html", "w", encoding="utf-8") as fh:
fh.write(env.get_template("report.html").render(locals()))
# locals() is a dict which contains all locally defined variables ;)
os.system("x-www-browser file://" + os.path.realpath("output.html") + " &")
Project ideas
Pastebin clone
Pastebin.com is a popular website for sharing code snippets via random generated URL-s. Due to security and privacy reasons some teams can not use third party operated websites such as Pastebin.com. It would be nice to have an open-source implementation of Pastebin which could be installed on premises.
- Use Falcon to implement the API.
- Use plain text files to store the pastes (data/<uuid prefix>/<uuid>/original_filename.ext).
- Use Pygments for syntax highlight.
- Add CAPTCHA for throttling anonymously submitting IP addresses.
- Document how you can run the app under uWSGI.
- optional: Add Kerberos support for authentication with AD domain computers
- Add configuration file which could be used to toggle features: anonymous submitting allowed, Kerberos enabled, path to directory of pastes etc
Chat/video conferencing
WebRTC is an exciting technology built into modern web browsers, it enables peer-to-peer data transfers between browsers. WebRTC can be used to implement text-based chat, file transfers and video calls. Here one of the possible ideas is to implement something usable for a small-sized company and provide integration with Active Directory or Samba based domain controller.
- easy: Basic user/session management
- easy: Mobile friendly UI
- medium: Phonebook integration via LDAP
- medium: Single sign-on via Kerberos
Example snippet for fetching full user names over LDAP:
import ldap, ldap.sasl
l = ldap.initialize('ldap://intra.itcollege.ee', trace_level=2)
l.set_option(ldap.OPT_REFERRALS, 0)
l.sasl_interactive_bind_s('', ldap.sasl.gssapi())
r = l.search_s('dc=intra,dc=itcollege,dc=ee',ldap.SCOPE_SUBTREE,'(&(objectClass=user)(objectCategory=person))',['cn','mail'])
for dn,entry in r:
if not dn: continue
full_name, = entry["cn"]
print full_name
Enhanced web server index view
It is relatively easy to configure nginx/Apache to show a fancier directory index which could be used for example to enable multimedia playback capabilities for a directory served via web. There is already some code which can be used as basis.
Pythonize robots
Current football robot software stack is written in C++ using Qt framework. With proper layering we could move it to Python while still keeping performance-sensitive stuff in C/C++ libraries such as OpenCV. This way we could more easily get newbies involved in the actual game strategy programming.
At first glance the new engine could, see preliminary example PyRobovision:
- hardcore: engine based on event loop (epoll)
- done: use OpenCV Python bindings for image recognition. Guide for Windows is here
- hardcore: support loading Python scripts from files to be used for game logic
- done: support streaming MJPEG to the web browser for debugging
- done: support overlay of interesting scene objects in the browser
- hardcore: support websockets to interact with a web browser
- überhardcore: explore PyCUDA if that sounds like a viable approach
- überhardcore: explore machine learning for certain aspects
Some of these things are of course far fetched. We can simply start with an event loop that forwards frames to a web browser and then step by step improve that. In reality it would be good enough to have something by the end of the semester that could be reused for next Robotex.
Butterknife
Butterknife is a tool for deploying Linux-based desktop OS on bare metal. It's pretty much usable, but could use some refactoring and extra features.
- easy: Add Travis CI tests
- easy: Add unittests
- easy: Add automatable nightly builds for templates
- easy: Add init subcommand for setting up Butterknife server
- easy: Set up Butterknife server for robot firmware(s)
- medium: Fix push/pull
- hardcore: Online incremental upgrades and tray icon
- hardcore: Dockerize Butterknife server
Hardcore tasks are for those who *really* want to understand how a Linux-based OS is put together. Every decent hacker has a distribution named after him/her right? ;)
Certidude
Certidude is a tool for managing (VPN) certificates and setting up services (StrongSwan, OpenVPN, Puppet?) to use those certificates. There's a lot room for experimentation and learning how different software/hardware components and technologies work together.
- done: Fix nchan support
- easy: Fix Travis CI
- easy: Add command-line features
- easy: Add OpenVPN support, goes hand-in-hand with Windows packaging
- easy: Add Puppet support, goes hand-in-hand with autosign for domain computers below
- easy: Add minimal user interface with GTK or Qt bindings
- medium: Certificate signing request retrieval from IMAP mailbox
- medium: Certificate issue via SMTP, goes hand-in-hand with previous task
- medium: Certificate renewal
- medium: Add unittests
- medium: LDAP querying for admin group membership
- medium: Autosign for domain computers (=Kerberos authentication)
- medium: Refactor tagging (?)
- hardcore: Add (service+UI) packaging for Windows as MSI
- hardcore: Add SCEP support
- hardcore: Dockerize Certidude server
The topics discussed in this project have significant overlap with authentication/authorization and firewalls/VPN-s electives next year, so doing this kind of stuff already now makes it easier to comprehend next year ;)
Active Directory web interface
Some stuff was written for managing users in OpenLDAP database in 2014. It should be of reasonable effort to patch the code to work with MS Active Directory and Samba4. Samba python scripts can be used to talk to the domain controller. Some code for adding users by Estonian ID-code is already there. Should be doable by capable student or two. This should be easily combinable with Web Application Programming (Võrgurakendused) ;)
- easy: Add Travis CI
- medium: Port to AD/Samba4
- medium: Add group management
- medium: Add Kerberos support for authenticating users
- medium: Check membership of domain admins group via LDAP
- medium: One-time registration link generation, for sending account creation link to a friend
- hardcore: Check delegation instead of group membership
- hardcore: Dockerize Samba4 + web interface
The topics discussed in this project have significant overlap with authentication/authorization elective next year, so doing this kind of stuff already now makes it easier to comprehend next year ;)
This category currently contains no pages or media.