UbiComp2019_EduSense
2021-04-30 17:48:37 0 举报
AI智能生成
智慧课堂分析顶会论文
作者其他创作
大纲/内容
Abstract
1) High-quality opportunities for professional development of university teachers need classroom data.
2) Currently, there is no effective mechanism to give personalized formative feedback except manually.
3) This paper shows a culmination of two years of research: EduSense (with visual and audio features)
4) EduSense is the first to unify previous isolative features into a cohesive, real-time, and practically-deployable system
2) Currently, there is no effective mechanism to give personalized formative feedback except manually.
3) This paper shows a culmination of two years of research: EduSense (with visual and audio features)
4) EduSense is the first to unify previous isolative features into a cohesive, real-time, and practically-deployable system
Key Words
Classroom, Sensing, Teacher, Instructor, Pedagogy,
Computer Vision, Audio, Speech Detection, Machine Learning
Computer Vision, Audio, Speech Detection, Machine Learning
Introduction
> 增加学生在课程中的投入度和参与度(engagement and participation)被证明可以有效提升学习产出;
> 与K-12的教师相比,大学教师一般仅仅是领域专家(domain experts),而不擅长如何教学生
> 与K-12的教师相比,大学教师一般仅仅是领域专家(domain experts),而不擅长如何教学生
> 正常且规律的教学反馈对教师提升教学技能很重要,想要习得教育学技巧(pedagogical skill)并不容易
> acquiring regular, accurate data on teaching practice is currently not scalable
> 当今的教学反馈数据严重依赖专业人士的观察(professional human observers),而这是非常昂贵的
> acquiring regular, accurate data on teaching practice is currently not scalable
> 当今的教学反馈数据严重依赖专业人士的观察(professional human observers),而这是非常昂贵的
> EduSense captures a wide variety of classroom facets shown to be actionable in the learning science literature, at a scale and temporal
fidelity many orders of magnitude beyond what a traditional human observer in a classroom can achieve.
> EduSense captures both audio and video streams using low-cost commodity hardware that views both the instructor and students
> Detection: hand raises, body pose, body accelerometry, and speech acts. Tabel-1 is the detail.
fidelity many orders of magnitude beyond what a traditional human observer in a classroom can achieve.
> EduSense captures both audio and video streams using low-cost commodity hardware that views both the instructor and students
> Detection: hand raises, body pose, body accelerometry, and speech acts. Tabel-1 is the detail.
> EduSense是首个将之前所有众多单个教学场景特征融合在一起的系统
> EduSense力求做到两件事:1)为教学者提供教育学相关的教室上课场景数据供其练习成长,2)成为一个可拓展的开放平台
> EduSense力求做到两件事:1)为教学者提供教育学相关的教室上课场景数据供其练习成长,2)成为一个可拓展的开放平台
Related Systems
> There is an extensive learning science literature on methods to improve instruction through training and feedback.
> [15] [26] [27] [32] [37] [38] [77] [78] PS:好像全是CMU的文章
> [15] [26] [27] [32] [37] [38] [77] [78] PS:好像全是CMU的文章
2.1 Instrumented Classrooms (仪器教室)
> 使用一些传感器(如pressure sensors [2][58])收集课堂中学生的数据,或者使用仪器测量教室的物理结构。
> 使用一些传感器(如pressure sensors [2][58])收集课堂中学生的数据,或者使用仪器测量教室的物理结构。
- adding computing to the tabletop (e.g., buttons, touchscreens, etc.) or with response systems like "clickers" [1][12][20][21][68]
- low-cost printed responses using color markers [25], QR Codes [17] or ARTags [57]
- Affectiva’s wrist-worn Q sensor [62] senses the wearer’s skin conductance, temperature and motion (via accelerometers)
- EngageMeter [32] used electroencephalography headsets to detect shifts in student engagement, alertness, and workload
- Instrument just the teacher, with e.g., microphones [19].
2.2 Non-Invasive Class Sensing (非侵入式等级感应)
> 我们的初衷是使用尽量少的入侵式设备来最大化应用价值。在众多的非入侵式传感器中,声音和视觉(acoustic and visual)几乎是课堂感知必备的
> Speech
> 我们的初衷是使用尽量少的入侵式设备来最大化应用价值。在众多的非入侵式传感器中,声音和视觉(acoustic and visual)几乎是课堂感知必备的
> Speech
- [19] used an omnidirectional(全方位的) room microphone and head-mounted teacher microphone to automatically segment teacher
and student speech events, as well as intervals of silence (such as after teacher questions). - AwareMe [11], Presentation Sensei [46] and RoboCOP [75] (Oral presentation practice systems 口头演讲练习系统)compute speech
quality metrics, including pitch variety, pauses and fillers, and speaking rate.
- Early systems, such as [23], targeted coarse tracking of people in the classroom, in this case using background subtraction and
color histograms. - Movement of students has also been tracked with optical flow algorithms, as was demonstrated in [54][63]
- Computer vision has also been applied to automatic detection of hand raises, including classic methods such as skin tone and
edge detection [41], as well as newer deep learning techniques [51](我们实验室的文章,linjiaojiao的举手检测).
- It can not only be used to find and count students, but also estimate their head orientation, coarsely signaling their area of
focus [63][73][80]. - Facial landmarks can offer a wealth of information about students' affective state, such as engagement [76] and frustration [6][31][43],
as well as detection of off-task behavior [7] - The Computer Expression Recognition Toolbox (CERT) [52] is most widely used in these educational technology applications, though
it is limited to videos of single students.
2.3 System Contribution
> 按例先踩一下上述的各种教室感知系统:
1)都是独立发表各项孤立指标,且没有在真实的大规模课堂场景下进行过测试和验证
2)各个系统都是针对单间教室配置单台服务器,不能在校园层面大规模推广
3)这些文献中的系统很少处理教学教育用途,因此没有考虑到在复杂的教室场景中使用最新的大量取得突破发展的计算机视觉和深度学习技术
> Thus, we believe EduSense is unique in putting together disparate advances from several fields into a comprehensive and scalable
system, paired with a holistic evaluation combining both controlled studies and months-long, real-world deployments.
> 按例先踩一下上述的各种教室感知系统:
1)都是独立发表各项孤立指标,且没有在真实的大规模课堂场景下进行过测试和验证
2)各个系统都是针对单间教室配置单台服务器,不能在校园层面大规模推广
3)这些文献中的系统很少处理教学教育用途,因此没有考虑到在复杂的教室场景中使用最新的大量取得突破发展的计算机视觉和深度学习技术
> Thus, we believe EduSense is unique in putting together disparate advances from several fields into a comprehensive and scalable
system, paired with a holistic evaluation combining both controlled studies and months-long, real-world deployments.
EduSense System
Four key layers: Classrooms layer、Processing layer、Datastore layer、Apps layer
3.1 Sensing
> Early system:depth cameras
> Current system:Lorex LNE8950AB camreas offer a 112° field of view and feature an integrated microphone, costing around $150 in
single unit retail prices. It can capture 3840x2160 video (i.e., 4K) at 15 FPS with 16 kHz mono audio.
> Early system:depth cameras
> Current system:Lorex LNE8950AB camreas offer a 112° field of view and feature an integrated microphone, costing around $150 in
single unit retail prices. It can capture 3840x2160 video (i.e., 4K) at 15 FPS with 16 kHz mono audio.
3.2 Compute
> Early system:
* small Intel NUCs. However, this hardware approach was expensive to scale, deploy and maintain
* 前期版本的系统是一个庞大而单一的(monolithic)C++应用程序,不但容易遇到各种如依赖冲突和加入新模块引起过载等软件工程问题,而且软件的远程部署同样是一个让人头疼的问题。
* 另外,这些C++版本的代码很难和计算机视觉最常用的python语言相结合,即使强行合并,也是耗时且系统极不稳定。这个旧版本的系统也因为各个组件模块之间没有相互隔离而很容易发生各种错误或崩溃掉。
> Current system:
* 新的系统使用了更加稳定的IP cameras,配合布置在学校中心的服务器,两者之间再通过RTSP协议实时传输音频和视频流,形成新的系统框架。
* The custom GPU-equipped EduSense server has 28 physical cores (56 cores with SMT), 196GB of RAM and nine NVIDIA 1080Ti GPUs
* 新系统使用了docker容器技术(container-based virtualization),将各个模块孤立开单独执行,docker的优势无需赘述。
> Early system:
* small Intel NUCs. However, this hardware approach was expensive to scale, deploy and maintain
* 前期版本的系统是一个庞大而单一的(monolithic)C++应用程序,不但容易遇到各种如依赖冲突和加入新模块引起过载等软件工程问题,而且软件的远程部署同样是一个让人头疼的问题。
* 另外,这些C++版本的代码很难和计算机视觉最常用的python语言相结合,即使强行合并,也是耗时且系统极不稳定。这个旧版本的系统也因为各个组件模块之间没有相互隔离而很容易发生各种错误或崩溃掉。
> Current system:
* 新的系统使用了更加稳定的IP cameras,配合布置在学校中心的服务器,两者之间再通过RTSP协议实时传输音频和视频流,形成新的系统框架。
* The custom GPU-equipped EduSense server has 28 physical cores (56 cores with SMT), 196GB of RAM and nine NVIDIA 1080Ti GPUs
* 新系统使用了docker容器技术(container-based virtualization),将各个模块孤立开单独执行,docker的优势无需赘述。
Fig. 3. Processing pipeline. Video and audio from classroom cameras first flows into a scene parsing layer,
before being featurized by a series of specialized modules. See also Figure 1 and Table 1.
3.3 Scene Parsing
Techniques
> Multi-person body keypoint (joints) detection: OpenPose (tested and tuned OpenPose parameters)
> Difficult Envoriment:high, wall-mounted (i.e., non-frontal) and slightly fish-eyed view.
> Algorithm:additional logic to reduce false positive bodies (e.g., bodies too large or small); interframe persistent person IDs with hysteresis
(tracking) using a combination of Euclidean distance and body inter-keypoint distance matching
> Speech:predict only silence and speech (Laput et al. [48].) + An adaptive background noise filter
before being featurized by a series of specialized modules. See also Figure 1 and Table 1.
3.3 Scene Parsing
Techniques
> Multi-person body keypoint (joints) detection: OpenPose (tested and tuned OpenPose parameters)
> Difficult Envoriment:high, wall-mounted (i.e., non-frontal) and slightly fish-eyed view.
> Algorithm:additional logic to reduce false positive bodies (e.g., bodies too large or small); interframe persistent person IDs with hysteresis
(tracking) using a combination of Euclidean distance and body inter-keypoint distance matching
> Speech:predict only silence and speech (Laput et al. [48].) + An adaptive background noise filter
Fig. 4. Top row: Example classroom scenes processed by EduSense (image data is not archived; shown here for reference
and with permission). Bottom row: Featurized data, including body and face keypoints, with icons for hand raise, upper
body pose, smile, mouth open, and sit/stand classification.
and with permission). Bottom row: Featurized data, including body and face keypoints, with icons for hand raise, upper
body pose, smile, mouth open, and sit/stand classification.
Fig. 5. Example participant from our controlled study. EduSense recognizes three upper body poses (left three image)
and various hand raises (right four images). Live classification from our upper body pose (orange text) and hand
classifiers (yellow text) are shown.
3.4 Featurization Modules
> 见图1和图3,特征化模块主要利用检测和识别算法的结果,将其按照教室中的指标可视化,便于调用或debug时查看
> For details:open source code repository (http://www.EduSense.io).
and various hand raises (right four images). Live classification from our upper body pose (orange text) and hand
classifiers (yellow text) are shown.
3.4 Featurization Modules
> 见图1和图3,特征化模块主要利用检测和识别算法的结果,将其按照教室中的指标可视化,便于调用或debug时查看
> For details:open source code repository (http://www.EduSense.io).
- Sit vs. Stand Detection:relative geometry of body keypoints(neck (1), hips (2), knees (2), and feet (2).)+ MLP classifier
- Hand Raise Detection:Use eight body keypoints per body(neck (1), chest (1), shoulder (2), elbow (2), and wrist (2).)+ MLP classifier
- Upper Body Pose:eight body keypoints + multiclass MLP model(预测arms at rest, arms closed (e.g., crossed), and hands on face 见上图5)
- Smile Detection:use ten mouth landmarks on the outer lip and ten landmarks on the inner lip + SVM for binary classification
- Mouth Open Detection:(As a potential, future way to identify speakers.) two features from [71] (left and right/mouth_width) + Binary SVM
- Head Orientation & Class Gaze:perspective-n-point algorithm [50] + anthropometric face data [53] + OpenCV's calib3d module [8]
- Body Position & Classroom Topology:借助前面提到的人脸关键点和相机标定,估测学生的位置,并将投影合成俯视视角(top-down view)图像(PS:类似我们系统中的学生定位,这里更粗略,不检测行列,也不涉及学生行为匹配)
- Synthetic Accelerometer:simply track the motion of bodies across frames + 3D head position + delta X/Y/Z normalized by the
elapsed time - Student vs. Instructor Speech:sound and speech detector including
1) the RMS of the student-facing camera’s microphone (closest to the instructor),
2) the RMS of the instructor-facing camera’s microphone (closest to the students),
and the ratio between the latter two values + random forest classifier (目的是区分当前的说话声来自学生还是老师,PS:区分教师音和学生音?) - Speech Act Delimiting:Use per-frame speech detection results???(PS:这里是要检测不同的语音片段吗?)
Fig. 6. Left: Training data capture rig in an example classroom. Right: Closeup of center mast, with six cameras.
3.5 Training Data Capture
> 首先,各种指标的实现需要大量标注过的数据作为训练集,这里遇到两个问题:
1)需要招聘大量人员参与标注,如举手
2)需要采集不同视角下的多样化的数据,因此需要自己布置采集数据的硬件设备和场景
3.5 Training Data Capture
> 首先,各种指标的实现需要大量标注过的数据作为训练集,这里遇到两个问题:
1)需要招聘大量人员参与标注,如举手
2)需要采集不同视角下的多样化的数据,因此需要自己布置采集数据的硬件设备和场景
3.6 Datastore
1)non-image classroom data (ASCII JSON),250MB for one class lasting around 80 minutes with 25 students
2)Infilled data (realtime class video), about 16GB for one class at 15FPS with 4K every frame for both front and back cameras
3)Web interface (Go APP) and MongoDB bulid a backend server. Also REST API + Transport Layer Security (TLS) (不同的技术路线和技术细节)
4)We do not save these frames long-term to mitigate obvious privacy concerns (数据不长期保存,一删了之,避免隐私问题)
5)secure Network Attached Storage (NAS)
1)non-image classroom data (ASCII JSON),250MB for one class lasting around 80 minutes with 25 students
2)Infilled data (realtime class video), about 16GB for one class at 15FPS with 4K every frame for both front and back cameras
3)Web interface (Go APP) and MongoDB bulid a backend server. Also REST API + Transport Layer Security (TLS) (不同的技术路线和技术细节)
4)We do not save these frames long-term to mitigate obvious privacy concerns (数据不长期保存,一删了之,避免隐私问题)
5)secure Network Attached Storage (NAS)
3.7 Automated Scheduling & Classroom Processing Instances
> scheduler:SOS JobScheduler (技术路线不同,我们使用的是python平台下的开源调度器apscheduler)
> FFMPEG instances:record the front and back camera streams (技术路线不同,我们使用的是opencv)
> scheduler:SOS JobScheduler (技术路线不同,我们使用的是python平台下的开源调度器apscheduler)
> FFMPEG instances:record the front and back camera streams (技术路线不同,我们使用的是opencv)
3.8 High Temporal Resolution Infilling
EduSense包括两种数据处理模式:real-time mode(0.5FPS);infilling mode(15FPS的视频)
> real-time模式,顾名思义需要在课程进行时同时出现各种分析指标,目前的效率是两秒钟一帧
> infilling模式,是在课程同时进行或课后进行的非实时分析,提供了高时序分辨率(high temporal resolution),是实时处理系统的补充。另外,这种更精确的分析还可以用于后续的end-of-day reports或semester-long analytics
EduSense包括两种数据处理模式:real-time mode(0.5FPS);infilling mode(15FPS的视频)
> real-time模式,顾名思义需要在课程进行时同时出现各种分析指标,目前的效率是两秒钟一帧
> infilling模式,是在课程同时进行或课后进行的非实时分析,提供了高时序分辨率(high temporal resolution),是实时处理系统的补充。另外,这种更精确的分析还可以用于后续的end-of-day reports或semester-long analytics
3.9 Privacy Preservation
> 已经采取的措施:EduSense不专门存储课堂视频;如果需要infilling模式,会在临时缓存中暂存,并在分析完成后删除这些视频;控制用户分权限分角色访问教室数据,防止数据泄露;追踪学生个体,但是并没有使用私密信息,且每节课tacking分配的ID互相之间没有关联;暂存的用于后续发展的视频(包括测试、验证和标注后扩充数据集),将在使用后被及时删除
> 未来将要采取的措施:仅仅只展示高阶抽象的课堂指标数据(class aggregates);
> 已经采取的措施:EduSense不专门存储课堂视频;如果需要infilling模式,会在临时缓存中暂存,并在分析完成后删除这些视频;控制用户分权限分角色访问教室数据,防止数据泄露;追踪学生个体,但是并没有使用私密信息,且每节课tacking分配的ID互相之间没有关联;暂存的用于后续发展的视频(包括测试、验证和标注后扩充数据集),将在使用后被及时删除
> 未来将要采取的措施:仅仅只展示高阶抽象的课堂指标数据(class aggregates);
Fig. 7. Although EduSense is mostly launched as a headless process, we built a utilitarian
graphical user interface for debugging and demonstration.
3.10 Debug and Development Interface
QT5 GUI + RTSP/local filesystem + many widgets
graphical user interface for debugging and demonstration.
3.10 Debug and Development Interface
QT5 GUI + RTSP/local filesystem + many widgets
3.11 Open Source and Community Involvement
- hope that others will deploy the system
- serve as a comprehensive springboard
- cultivate a community
Controlled Study
4.1 Overall Procedure
> five exemplary classrooms
> 5 instructors and 25 student participants
> 参与者按照事先提供的“指令表格”,依次按照相应的要求做出动作,同时debug系统会同时记录下这些动作的时刻、类型、以及图像数据
> five exemplary classrooms
> 5 instructors and 25 student participants
> 参与者按照事先提供的“指令表格”,依次按照相应的要求做出动作,同时debug系统会同时记录下这些动作的时刻、类型、以及图像数据
Fig. 10. Histogram showing the percent of different body keypoints found in three of our experimental contexts.
4.2 Body Keypointing
> Openpose被用来做姿态估计,但其在教室场景下的效果并不鲁棒,因此作者调整了算法的一些参数,外加一些pose的逻辑判断,提升了算法的稳定性和准确度(和我改进openpose的思路差不多?)
> 关于改进后openpose的效果,作者也没给出较严谨的测试结果,只是在少量数据集统计了关键点的效果(这种方式有道理吗?)
> 如上图,作者又统计了9种人体关键点的检测准确度,显然上半身比下半身的准确率要高(但这些准确率是在多少数据下统计的不可知)
4.2 Body Keypointing
> Openpose被用来做姿态估计,但其在教室场景下的效果并不鲁棒,因此作者调整了算法的一些参数,外加一些pose的逻辑判断,提升了算法的稳定性和准确度(和我改进openpose的思路差不多?)
> 关于改进后openpose的效果,作者也没给出较严谨的测试结果,只是在少量数据集统计了关键点的效果(这种方式有道理吗?)
> 如上图,作者又统计了9种人体关键点的检测准确度,显然上半身比下半身的准确率要高(但这些准确率是在多少数据下统计的不可知)
4.3 Phase A: Hand Raises & Upper Body Pose
> 作者分了七种上身姿态类别:arms resting, left hand raised, left hand raised partial, right hand raised, right hand raised partial, arms closed, and hands on face
> 参与实验的学生被要求在一堂课中,分别要执行三次这些姿态类别,共计21个实例
> 参与实验的老师被要求,分别要执行arms resting和arms closed三次,且在不同的教室位置(left front, center front, right front),共计6个实例
> We only studied frames where participants’ upper bodies were captured (consisting of head, chest, shoulder, elbow, and wrist
keypoints - without these eight keypoints, our hand raise classifier returns null).
> 另外,作者在文中提到的举手检测准确率高达94.6%,其它三类上身姿态检测准确率高达98.6%(学生)和100%(教师),但是没有提到训练集和测试集的规模,且这些都是在特定布置的实验场景中的结果,是否有说服力呢?
> 作者分了七种上身姿态类别:arms resting, left hand raised, left hand raised partial, right hand raised, right hand raised partial, arms closed, and hands on face
> 参与实验的学生被要求在一堂课中,分别要执行三次这些姿态类别,共计21个实例
> 参与实验的老师被要求,分别要执行arms resting和arms closed三次,且在不同的教室位置(left front, center front, right front),共计6个实例
> We only studied frames where participants’ upper bodies were captured (consisting of head, chest, shoulder, elbow, and wrist
keypoints - without these eight keypoints, our hand raise classifier returns null).
> 另外,作者在文中提到的举手检测准确率高达94.6%,其它三类上身姿态检测准确率高达98.6%(学生)和100%(教师),但是没有提到训练集和测试集的规模,且这些都是在特定布置的实验场景中的结果,是否有说服力呢?
Fig. 11. The mouth states captured in our controlled study: mouth closed, closed smile, teeth smile, and mouth open.
4.4 Phase B: Mouth State
> 作者设定了4种嘴部状体:neutral (mouth closed), mouth open (teeth apart, as if talking), closed smile (no teeth showing),
teeth smile (with teeth showing)
> 参与学生被要求每种状态执行三次,共计12个实验样例;
> 参与教师被要求每种状态执行三次,且在教室前面的不同位置,共计12个实验样例
> 基于以上人脸landmarks检测,作者做了微笑分类(准确率78.6%和87.2%),以及张嘴分类(准确率83.6%和82.1%)。但是仍旧没提数据量
> 作者坦承,由于分辨率问题,后排的学生人脸几乎不可准确检测landmarks,并乐观地认为高分辨率相机可以解决该问题。(实际上我们测试,即使是4K相机,仍旧存在低分辨率问题,且landmarks还有大角度和遮挡的问题)
4.4 Phase B: Mouth State
> 作者设定了4种嘴部状体:neutral (mouth closed), mouth open (teeth apart, as if talking), closed smile (no teeth showing),
teeth smile (with teeth showing)
> 参与学生被要求每种状态执行三次,共计12个实验样例;
> 参与教师被要求每种状态执行三次,且在教室前面的不同位置,共计12个实验样例
> 基于以上人脸landmarks检测,作者做了微笑分类(准确率78.6%和87.2%),以及张嘴分类(准确率83.6%和82.1%)。但是仍旧没提数据量
> 作者坦承,由于分辨率问题,后排的学生人脸几乎不可准确检测landmarks,并乐观地认为高分辨率相机可以解决该问题。(实际上我们测试,即使是4K相机,仍旧存在低分辨率问题,且landmarks还有大角度和遮挡的问题)
4.5 Phase C: Sit vs. Stand
> 这里作者主要是区分站立和坐下两种姿势。
> 同样按照前面的安排,学生参与者被要求在整个测试过程中,随机执行三次两种姿势,每个参与者共计6个实例;教师总是保持站立,本轮不参与
> 站立和坐下的分类准确率约为84.4%(尽管作者还是没提是在多大的数据集上测试的结果,但从这一章节提供的错误率推断出,总样例数量约为143)
> 由于只是依赖2D关键点检测的结果来分类,作者提到这种方法受到相机视角的影响很大。(那是当然,还是没有我们直接检测站立准确,且鲁棒性高)
> 作者最后又提到,将来可以使用深度数据,改善这种情况。(我只能说深度相机也不见得有用,况且深度数据并不好采集和用来训练)
> 这里作者主要是区分站立和坐下两种姿势。
> 同样按照前面的安排,学生参与者被要求在整个测试过程中,随机执行三次两种姿势,每个参与者共计6个实例;教师总是保持站立,本轮不参与
> 站立和坐下的分类准确率约为84.4%(尽管作者还是没提是在多大的数据集上测试的结果,但从这一章节提供的错误率推断出,总样例数量约为143)
> 由于只是依赖2D关键点检测的结果来分类,作者提到这种方法受到相机视角的影响很大。(那是当然,还是没有我们直接检测站立准确,且鲁棒性高)
> 作者最后又提到,将来可以使用深度数据,改善这种情况。(我只能说深度相机也不见得有用,况且深度数据并不好采集和用来训练)
Fig. 12. Example head orientations requested in our study, with detected face landmarks shown.
4.6 Phase D: Head Orientation
> 作者设定了8中头部朝向:three possible pitches (“down” -15°, “straight” 0°, “up” +15°) × three possible yaws (“left” -20°,
“straight” 0°, “right” +20°), omitting directly straight ahead (i.e., 0°/0°) (仍旧是将检测和估计问题,转化成了分类问题)
> 为了让参与者做出相应的head pose,作者设计使用运行位姿估计APP的智能手机,以及打印出来操作表格贴在课桌上。相关流程请阅读论文
> 同样,学生参与者被要求分别执行8种头部方向2次,这样每个人会产生16个实验样例
> Unfortunately, in many frames we collected, ~20% of landmarks were occluded by the smartphones we gave participants - an experimental
design error in hindsight (果不其然,这种依靠人脸landmarks的头部姿态估计方式,即使是在实验场景下,结果也并不靠谱)
> Which should be sufficient for coarse estimation of attention (作者删除掉一些landmarks检测不好的样例,仅仅剩下了1/4的数据,在这种情况下测试的结果,还要说sufficient,有点勉强了,甚至睁眼说瞎话了)
> 作者最后提到,主要问题还是出在landmarks的检测,将来能检测出来充足的landmarks点,就能解决头部朝向问题。(我对这种技术路线持保守态度)
4.6 Phase D: Head Orientation
> 作者设定了8中头部朝向:three possible pitches (“down” -15°, “straight” 0°, “up” +15°) × three possible yaws (“left” -20°,
“straight” 0°, “right” +20°), omitting directly straight ahead (i.e., 0°/0°) (仍旧是将检测和估计问题,转化成了分类问题)
> 为了让参与者做出相应的head pose,作者设计使用运行位姿估计APP的智能手机,以及打印出来操作表格贴在课桌上。相关流程请阅读论文
> 同样,学生参与者被要求分别执行8种头部方向2次,这样每个人会产生16个实验样例
> Unfortunately, in many frames we collected, ~20% of landmarks were occluded by the smartphones we gave participants - an experimental
design error in hindsight (果不其然,这种依靠人脸landmarks的头部姿态估计方式,即使是在实验场景下,结果也并不靠谱)
> Which should be sufficient for coarse estimation of attention (作者删除掉一些landmarks检测不好的样例,仅仅剩下了1/4的数据,在这种情况下测试的结果,还要说sufficient,有点勉强了,甚至睁眼说瞎话了)
> 作者最后提到,主要问题还是出在landmarks的检测,将来能检测出来充足的landmarks点,就能解决头部朝向问题。(我对这种技术路线持保守态度)
4.7 Phase E: Speech Procedure
> 这里只是识别是否有说话,包括教师和学生
> 实验方案是要求30个参与者分别说一次话,这样说说话语音段可以提取出30个5秒钟长的clips,非说话语音段同样可以提出30个段,再对这些语音段做分类。最终,no speech的识别100%正确,speech的识别仅有一个错误,准确率98.3%
> 我只能评价说,这样的语音指标和处理流程太过简单,且测试数据量太少,很缺乏说服力
> 这里只是识别是否有说话,包括教师和学生
> 实验方案是要求30个参与者分别说一次话,这样说说话语音段可以提取出30个5秒钟长的clips,非说话语音段同样可以提出30个段,再对这些语音段做分类。最终,no speech的识别100%正确,speech的识别仅有一个错误,准确率98.3%
> 我只能评价说,这样的语音指标和处理流程太过简单,且测试数据量太少,很缺乏说服力
4.8 Face Landmarks Results
> 人脸关键点检测直接使用了公共算法,如文献[4][13][44]。猜测大概率使用的是文献[13](CMU的Openpose)
> 同样是在实验环境下,这一段展示了缺乏说服力的所谓关键点检测准确率
> poor registration of landmarks was due to limited resolution (还是提到了低分辨率的问题)
> 人脸关键点检测直接使用了公共算法,如文献[4][13][44]。猜测大概率使用的是文献[13](CMU的Openpose)
> 同样是在实验环境下,这一段展示了缺乏说服力的所谓关键点检测准确率
> poor registration of landmarks was due to limited resolution (还是提到了低分辨率的问题)
4.9 Classroom Position & Sensing Accuracy vs. Distance
> We manually recorded the distance of all participants from the camera using a surveyors’ rope
> Computer-vision-driven modules are sensitive to image resolution and vary in accuracy as a function of distance from the camera.
> 这里有个疑问:教师和学生的检测不会重复吗?换句话说双方不会出现在彼此的镜头里面吗?如果出现了,文中并没有考虑如何区分两者。
> We manually recorded the distance of all participants from the camera using a surveyors’ rope
> Computer-vision-driven modules are sensitive to image resolution and vary in accuracy as a function of distance from the camera.
> 这里有个疑问:教师和学生的检测不会重复吗?换句话说双方不会出现在彼此的镜头里面吗?如果出现了,文中并没有考虑如何区分两者。
Fig. 15. Runtime performance of EduSense’s various processing stages
at different loads (i.e., number of students).
4.10 Framerate and Latency
> 测试阶段,只考虑处理已保存的视频数据,暂不考虑实时系统
> 不出意外,关键点检测(body keypointing)和人脸关键点检测(face landmarking)两种基础映射函数占据了大部分时间。尤其是人脸关键点定位算法的耗时, 和图像中的人物数量呈正相关函数增长.(这里有点疑问,姿态估计使用的是Bottom-up的openpose算法,所以检测耗时不随人数增长而简单地线性增长,但上图中,人数从0增加到54,检测耗时完全没有增加,这显然是假的。因为我实测过,openpose在joints grouping环节,也会占据部分CPU时间。另外,openpose算法本身的检测耗时只有约几十毫秒,也不可信,输入图像即使只有1K图像的0.5倍大小,也需要1秒左右的时间。)
> 其他处理流程的耗时,暂看不出问题
at different loads (i.e., number of students).
4.10 Framerate and Latency
> 测试阶段,只考虑处理已保存的视频数据,暂不考虑实时系统
> 不出意外,关键点检测(body keypointing)和人脸关键点检测(face landmarking)两种基础映射函数占据了大部分时间。尤其是人脸关键点定位算法的耗时, 和图像中的人物数量呈正相关函数增长.(这里有点疑问,姿态估计使用的是Bottom-up的openpose算法,所以检测耗时不随人数增长而简单地线性增长,但上图中,人数从0增加到54,检测耗时完全没有增加,这显然是假的。因为我实测过,openpose在joints grouping环节,也会占据部分CPU时间。另外,openpose算法本身的检测耗时只有约几十毫秒,也不可信,输入图像即使只有1K图像的0.5倍大小,也需要1秒左右的时间。)
> 其他处理流程的耗时,暂看不出问题
Real-world Classrooms Study
5.1 Deployment and Procedure
> We deployed EduSense in 13 classrooms at our institution and recruited 22 courses for an "in-the-wild" evaluation
(with a total student enrollment of 687).
> 360.8 hours of classroom data
> 438,331 student-facing frames and 733,517 instructor-facing frames were processed live, with a further 18.3M frames infilled
after class to bring the entire corpus up to a 15 FPS temporal resolution.
> We randomly pulled 100 student-view frames (containing 1797 student body instances) and 300 instructor-view frames
(containing 291 instructor body instances; i.e., nine frames did not contain instructors) from our corpus.
> This suset is sufficiently large and diverse (不敢苟同)
> To provide the ground truth labels, we hired two human coders, who were not involved in the project. (和我们的数据标注工作相比,EduSense这点工作量很单薄)
> It was not possible to accurately label head orientation and classroom position (有很多指标是粗略估计,但是位置如果采用我们的行列表示来评估,会更精确测量和评价)
> We deployed EduSense in 13 classrooms at our institution and recruited 22 courses for an "in-the-wild" evaluation
(with a total student enrollment of 687).
> 360.8 hours of classroom data
> 438,331 student-facing frames and 733,517 instructor-facing frames were processed live, with a further 18.3M frames infilled
after class to bring the entire corpus up to a 15 FPS temporal resolution.
> We randomly pulled 100 student-view frames (containing 1797 student body instances) and 300 instructor-view frames
(containing 291 instructor body instances; i.e., nine frames did not contain instructors) from our corpus.
> This suset is sufficiently large and diverse (不敢苟同)
> To provide the ground truth labels, we hired two human coders, who were not involved in the project. (和我们的数据标注工作相比,EduSense这点工作量很单薄)
> It was not possible to accurately label head orientation and classroom position (有很多指标是粗略估计,但是位置如果采用我们的行列表示来评估,会更精确测量和评价)
5.2 Body Keypointing Results
> EduSense found 92.2% of student bodies and 99.6% of instructor bodies. (实际教室场景中的测试还是囿于少量数据之中,缺乏说服力)
> 59.0% of student and 21.0% of instructor body instances were found to have at least one visible keypoint misalignment (真实效果不一定好)
> We were surprised that our real-world results were comparable to our controlled study, despite operating in seemingly much more
challenging scenes (作者分析,和实验场景中刻意控制的复杂姿势和头部朝向相比,真实场景尽管更混乱(chaotic),但学生们一般都是直视前方,且姿态总是保持倚在课桌上,更容易识别)
> EduSense found 92.2% of student bodies and 99.6% of instructor bodies. (实际教室场景中的测试还是囿于少量数据之中,缺乏说服力)
> 59.0% of student and 21.0% of instructor body instances were found to have at least one visible keypoint misalignment (真实效果不一定好)
> We were surprised that our real-world results were comparable to our controlled study, despite operating in seemingly much more
challenging scenes (作者分析,和实验场景中刻意控制的复杂姿势和头部朝向相比,真实场景尽管更混乱(chaotic),但学生们一般都是直视前方,且姿态总是保持倚在课桌上,更容易识别)
5.3 Face Landmarking Results
> 仍旧是在部分数据集上分别统计了学生和老师的人脸检测准确率,以及相应的人脸关键点定位准确率 (缺乏在大规模标注的数据集上的测试结果)
> 作者提到尽管真实场景更复杂,人脸检测算法的结果还是相当鲁棒的(这是公共算法的功劳,这里提及的意义何在?)
> 仍旧是在部分数据集上分别统计了学生和老师的人脸检测准确率,以及相应的人脸关键点定位准确率 (缺乏在大规模标注的数据集上的测试结果)
> 作者提到尽管真实场景更复杂,人脸检测算法的结果还是相当鲁棒的(这是公共算法的功劳,这里提及的意义何在?)
5.4 Hand Raise Detection & Upper Body Pose Classification
> Hand raises in our real-world dataset were exceedingly rare (毫无意外,上述22个视频的测试量,以及大学课堂场景,注定了举手样例是稀缺的)
> Our of our 1797 student body instances, we only found 6 body instances with hand raised (representing. less 0.3% of total body instances).
Of those six hand raised instances, EduSense correctly labeled three, incorrectly labeled three, and missed zero, for an overall true positive
accuracy of 50.0%. There was also 58 false positive hand raised instances (3.8% of total body instances). (举手姿势的效果惨不忍睹)
> 其他姿势的实测效果也不是很好,且同样存在数据量少、缺乏说服力的缺陷
> Hand raises in our real-world dataset were exceedingly rare (毫无意外,上述22个视频的测试量,以及大学课堂场景,注定了举手样例是稀缺的)
> Our of our 1797 student body instances, we only found 6 body instances with hand raised (representing. less 0.3% of total body instances).
Of those six hand raised instances, EduSense correctly labeled three, incorrectly labeled three, and missed zero, for an overall true positive
accuracy of 50.0%. There was also 58 false positive hand raised instances (3.8% of total body instances). (举手姿势的效果惨不忍睹)
> 其他姿势的实测效果也不是很好,且同样存在数据量少、缺乏说服力的缺陷
5.5 Mouth Smile and Open Detection
> Only 17.1% of student body instances had the requisite mouth landmarks present for EduSense’s smile detector to execute. (有效数据更少了) --(Student)smile vs. no smile classification accuracy was 77.1%
> Only 21.0% of instructor body instances having the required facial landmarks. (同样少了很多测试数据) --(Instructor)smile vs. no smile classification accuracy was 72.6%
> mouth open/closed detection, accuracy was stronger – 96.5%(Student)和 82.3%(Instuctor)(注意,其中大部分都是闭嘴的样例,约占94.8%)(这里作者分析道:张嘴检测和微笑相比,更不易察觉)
> 最后,作者还是提到分辨率的问题,张嘴/闭嘴检测还是强依赖嘴的分辨率,另外,标注者对张嘴的判断也有会有主观性的(subjective)干扰。所以,这个指标只是初步性的(preliminary)
> Only 17.1% of student body instances had the requisite mouth landmarks present for EduSense’s smile detector to execute. (有效数据更少了) --(Student)smile vs. no smile classification accuracy was 77.1%
> Only 21.0% of instructor body instances having the required facial landmarks. (同样少了很多测试数据) --(Instructor)smile vs. no smile classification accuracy was 72.6%
> mouth open/closed detection, accuracy was stronger – 96.5%(Student)和 82.3%(Instuctor)(注意,其中大部分都是闭嘴的样例,约占94.8%)(这里作者分析道:张嘴检测和微笑相比,更不易察觉)
> 最后,作者还是提到分辨率的问题,张嘴/闭嘴检测还是强依赖嘴的分辨率,另外,标注者对张嘴的判断也有会有主观性的(subjective)干扰。所以,这个指标只是初步性的(preliminary)
5.6 Sit vs. Stand Classification
> We found that a vast majority of student lower bodies were occluded, which did not permit our classifier to produce a sit/stand classification,
and thus we omit these results (实际测试阶段,没有包括学生的坐下/站立分类指标)
> 教师也只有66.3%的帧能被检测到下半身关键点,其中坐下和站立的识别准确率粉笔是90.5%和95.2%(数据量较少,可信度如何?)
> We found that a vast majority of student lower bodies were occluded, which did not permit our classifier to produce a sit/stand classification,
and thus we omit these results (实际测试阶段,没有包括学生的坐下/站立分类指标)
> 教师也只有66.3%的帧能被检测到下半身关键点,其中坐下和站立的识别准确率粉笔是90.5%和95.2%(数据量较少,可信度如何?)
5.7 Speech/Silence & Student/Instructor Detection
> 关于Speech/Silence分类,作者分别选择了"50段5秒长的有声"和"50段5秒长的无声",用来测试准确率,最终结果是82%
> 关于Student/Instructor Detection,作者的方法是选择”25段10秒长的教师声“和”25段10秒长的学生音“,结果只有60%的准确率能分别说话者(意料之中,接近50%的随机猜测概率)
> 作者认为,现阶段的说话人识别受到教室的结构和麦克风采集位置的影响很大,而仅有两个语音采集设备也是不够的。想解决该问题只能引入更复杂的方法:说话人识别 speaker identification
> 关于Speech/Silence分类,作者分别选择了"50段5秒长的有声"和"50段5秒长的无声",用来测试准确率,最终结果是82%
> 关于Student/Instructor Detection,作者的方法是选择”25段10秒长的教师声“和”25段10秒长的学生音“,结果只有60%的准确率能分别说话者(意料之中,接近50%的随机猜测概率)
> 作者认为,现阶段的说话人识别受到教室的结构和麦克风采集位置的影响很大,而仅有两个语音采集设备也是不够的。想解决该问题只能引入更复杂的方法:说话人识别 speaker identification
5.8 Framerate & Latency
> 详细的耗时分析参见Figure 15
> We achieve a mean student view processing framerate of between 0.3 and 2.0 FPS. (现阶段线下视频的处理速度有这么快吗?)教师路2-3 times faster
> 根据耗时分析,实时系统的处理延时为3~5秒,其中各个部分耗时长短依次是:IP cameras > backend proccessing > storing results > transmission (wired network)
> 作者认为,未来更高端的 IP cameras将会减少时延,促进实时系统的大规模应用(5G + 高端嵌入式摄像头处理芯片?)
> 详细的耗时分析参见Figure 15
> We achieve a mean student view processing framerate of between 0.3 and 2.0 FPS. (现阶段线下视频的处理速度有这么快吗?)教师路2-3 times faster
> 根据耗时分析,实时系统的处理延时为3~5秒,其中各个部分耗时长短依次是:IP cameras > backend proccessing > storing results > transmission (wired network)
> 作者认为,未来更高端的 IP cameras将会减少时延,促进实时系统的大规模应用(5G + 高端嵌入式摄像头处理芯片?)
End-user Applications
Fig. 16. Preliminary classroom data visualization app.
> Our future goal with EduSense is to power a suite of end-user, data-driven applications.
> 如何设计前端的展示页面,也很讲究,作者提出了几种可能的选择
> 紧接着,作者继续重申EduSense检测教师指标并提供实时意见可能起到的积极作用(包括gaze direction [65], gesticulation though hand movement [81], smiling [65], and moving around the classroom [55][70])
> A web-based data visualizer (Figure 16): Node.js + ECharts + React (前端框架)
> 如何设计前端的展示页面,也很讲究,作者提出了几种可能的选择
- tracking the elapsed time of continuous speech, to help instructors inject lectures with pauses, as well as opportunities for student
questions and discussion. (教师音检测+计时?) - automatically generated include suggestions to increase movement at the front of the class (教师轨迹?)
- and modify the ratio of facing the board vs. facing the class. (教师朝向比例?)
- a cumulative heatmap of all student hand raised thus far in the lecture, which could facilitate selecting a student who has yet to
contribute (举手热力图?) - a histogram of the instructor's gaze could highlight areas of the classroom receiving less visual attention (教师视线追踪+统计?)
> 紧接着,作者继续重申EduSense检测教师指标并提供实时意见可能起到的积极作用(包括gaze direction [65], gesticulation though hand movement [81], smiling [65], and moving around the classroom [55][70])
> A web-based data visualizer (Figure 16): Node.js + ECharts + React (前端框架)
Discussion
> Taken together, our controlled and real classroom studies offer the first comprehensive evaluation of a holistic audio- and computer-vision-driven classroom sensing system, offering new insights into the feasibility of automated class analytics (句子很长,口气很大)
> 经过实验和实测,作者给出了一些布置应用场景的建议:比如不要选择过大的教室(前后最大不要超过8M);摄像头安装在合适的位置提供好的教室视角
> 作者指出,系统中的算法错误具有传递效应,文中已经分阶段分部分阐述了各个模块的上限和下限。
> 作者接着指出,系统还有很多工作要做,系统的完善需要公共社区的研究不断提供帮助,且需要同大学和高中等终端使用者们多接触多沟通。
> We also envisio(展望、想象) EduSense as a stepping stone towards the furthering of a university culture that values professional development for teaching (美好的愿景,同时也是我们对自己在做的系统的愿景)
> 经过实验和实测,作者给出了一些布置应用场景的建议:比如不要选择过大的教室(前后最大不要超过8M);摄像头安装在合适的位置提供好的教室视角
> 作者指出,系统中的算法错误具有传递效应,文中已经分阶段分部分阐述了各个模块的上限和下限。
> 作者接着指出,系统还有很多工作要做,系统的完善需要公共社区的研究不断提供帮助,且需要同大学和高中等终端使用者们多接触多沟通。
> We also envisio(展望、想象) EduSense as a stepping stone towards the furthering of a university culture that values professional development for teaching (美好的愿景,同时也是我们对自己在做的系统的愿景)
Conclusion
- 1. We have presented our work on EduSense, a comprehensive classroom sensing system that produces a wide variety of theoretically-motivated features, using a distributed array of commodity cameras. (贡献)
- 2. We deployed and tested our system in a controlled study, as well as real classrooms, quantifying the accuracy of key system features in
both settings. (分析) - 3. We believe EduSense is an important step towards the vision of automated classroom analytics, which hold the promise of offering a
fidelity, scale and temporal resolution, which are impractical with the current practice of in-class observers. (愿景) - 4. To further our goal of an extensible platform for classroom sensing that others can also build on, EduSense is open sourced and
available to the community. (号召)
0 条评论
下一页