CS 4501: Introduction to Computer Vision
Sample Final Project Topics
You are by no means restricted to these final project topics. But you can look at this list to get ideas for projects.
Sample Topics
- Neural network agents for playing Atari games similar to deep Q networks. A warning though: getting this to work well is apparently not so easy. You may wish to look at a previous final writeup of Zack Verham and Prabhat Rayapati. Another variant on this is the VizDoom project (see also a paper behind the winning submission).
- Computer generated art using CNNs. There are a bunch of startups doing this such as Prisma. One possibility would be to reproduce the system of Gatys et al.. A more ambitious task would be to incorporate some MRF matching terms such as in the paper by Li and Wand 2016.
- Image classifiers for a topic of your interest: for example, classify species of birds, or dishes of food (is this food healthy or unhealthy?), sports games (are the players active or is it a boring time we can skip?), "good" vs bad photos, or any topic of your interest. One easy way to create a custom classifier is to train a LeNet-based classifier. Another generally better way is to fine-tune starting with an ImageNet trained classifier such as AlexNet, although this is easier on the GPU. One might experiment with both of these approaches. One challenge in this topic will be creating your custom dataset for supervised learning and using data augmentation with it.
- On a mobile phone, there is limited computational budget. Therefore, it might be interesting to detect particular classes of object(s) using an ImageNet-pretrained neural network library at a low framerate (e.g. YOLO or Tiny-YOLO), and then track the objects for other intervening frames (at a higher framerate) using optical flow such as in OpenCV. See the previous final project writeup of Yang You et al. A similar approach could be used for semantic segmentation. An easier version of the same project would be to do all of this on a desktop machine, which could still have benefits of lowering the computation budget or increasing framerate. A third possibility is to have a mobile client-server interface, where a Web server runs a complicated deep model that requires a lot of compute, and a mobile client updates detected object positions using optical flow.
- Image super resolution. There have been some papers on this such as Ledig et al. 2016, and Sajjadi et al. 2016 (the latter is probably easier due to not having any new layer types). It is probably best to use a GPU for this project, although I suppose some lower quality of super resolution could still be obtained from CPU.
- Various generative adversarial networks have been used to generate novel imagery recently. One possibility would be to explore one of the applications in Isola et al. 2017, or a new application in their framework. Note that a recent paper also mentions that least square losses tend to work better for GANs.
- Human pose estimation such as in DeepPose. Some datasets are available for this task such as FLIC, LSP extended, MPII.
- Physical simulation. There have been some pilot works on predicting the next state of a physical system using a CNN based on the previous state, such as Tompson et al. 2017. One simpler option would be to start with an existing 2D fluid simulator and try to predict the next frame's fluid velocity directly. Another simple option would be to revisit the idea of the video textures project using an CNN approach, where one trains a CNN to output the next frame in an infinite video.
- Video games based on optical flow and/or object detection. There are some fun examples on YouTube: [1], [2]. However, these examples only use optical flow. It might be also interesting to use object detectors or semantic segmentation (e.g. SegNet) to create other kinds of interactions. A simple example would be one could hold up pieces of paper with different printed icons to control different actions in the game (recognizing the different icons could be done by e.g. normalized cross correlation in OpenCV and/or color histogram matching if the icons are sufficiently colorful). See also the PI's Masters project on a similar topic of controlling animation with puppets.
- Games involving interaction with the physical world. One could imagine creating a "touring the campus" game where you have to find a physical location, such as the Rotunda, or the stadium, and snap a photo, whereby a computer vision system (e.g. based on SIFT matching) could validate that the photo matches up with the target that you were looking for, and point you to the next place to walk to.
- Structure from Motion. There are existing workflows such as this one based on VisualSFM and MeshLab that let one reconstruct 3D scenes from multiple photos (warning: the photos need a lot of texture and overlapping content so the matching succeeds...try it first on some small textured objects). The idea is that one takes many photos and each photo has some associated 3D camera viewpoint associated with it in the SFM reconstruction. One application could be to just run this pipeline and then visualize the resulting reconstruction in a viewer such as three.js (which also has some VR options). But that might involve mostly library calls, so to require some programming, one could imagine adding some interactions to this. One example could be (similar to the "touring the campus" game in the previous bullet point), if a user takes a photo in the real world, one could move the camera to the most similar viewpoint in the reconstruction (based on counting SIFT feature matches against the database of photos taken for the reconstruction step).
- Security camera apps such as tracking people or pets, and issuing an alert status if movement is detected (e.g. using optical flow), or one could even attempt to detect / recognize human face (see eigenfaces, Fisherfaces, or the Viola-Jones face detector). See Nest Cam
- Assistive technologies that could help the visual impaired. One inspiration for this could be the OrCam. One simple project idea could be to combine a camera (e.g. webcam or mobile camera) with a detector that recognizes hand gesture (e.g. see some tutorials about hand recognition with OpenCV) and then uses say an ImageNet-pretrained CNN model to say what object was pointed to. (There are existing text to speech engines that could be used, such as "say" on Mac, PTTS on Windows, espeak on Linux).